Compressing an Index

How to remove obsolete data from an index and optimize the index structure.

Remarks

Compressing an index optimizes the index structure, removing obsolete data and defragmenting search structures for better performance.

Obsolete data and fragmentation

Obsolete data comes from documents that are reindexed and from documents that are removed from the index. In both cases, the data is not removed from the index but is tagged as "obsolete" for removal the next time the index is compressed.

Fragmentation occurs when an index is updated. Each time an index update commits, fragmentation of the index increases. For example, if you build an index and then update it 5 times, fragmentation would be at least 6. It might be more, because some updates might involve more than one commit (for example, if the update involves a lot of data).

Fragmentation is also affected by the amount of data that is fragmented. On each update, dtSearch appends data to what is already in the index, without affecting fragmentation of the existing data. For example, if you start with a 50 Gb index that is fully compressed, and then update it 10 times, each time adding a one page document, fragmentation will be 11 but only a tiny percentage of the index data will be fragmented.

When to compress

The time required to compress an index is proportional to the size of the index, and will depend on both CPU and hard disk speed. For a rough estimate of the time needed to compress an index, divide the total size of the index by 150 Mb to get compression time in minutes.

For frequently-updated indexes, compression is usually done on a schedule (once a day during off-hours, or once a week over the weekend). If an index is not frequently updated in small batches, then a simple rule of thumb is to compress after initial creation and then to compress whenever the amount of data indexed increases by 10-20%, or the amount of obsolete data exceeds 20%.

Document ids

Compressing an index reassigns all document ids in the index to consecutive ids starting with 1 unless the dtsIndexKeepExistingDocIds flag is set in IndexJob.

Group

Building and Maintaining Indexes

API

Language	API
C/C++	DIndexJob or dtsIndexJob, set action.compress = true
.NET (C#, VB.NET)	dtSearch::Engine::IndexJob, set ActionCompress = true
Java	com.dtsearch.engine.IndexJob, setActionCompress(true)
COM (Visual Basic, ASP)	IIndexJob (IndexJob) object, set ActionCompress = true