Factors affecting indexing performance.
Indexing speed can vary considerably depending on the type of data being indexed and the hardware being used. A sample benchmark:
Data: 170 GB, consisting of 5,229,165 mixed-type files (HTML, office documents, text)
Indexing time: 4 hours
Unique words: 68,372,337
Index size: 14% of original document size
For information on factors affecting indexing speed, see Optimizing indexing of large document collections
The dtSearch indexer is designed to operate best when indexing large volumes of text at once. Therefore, it is preferable to index data in batches that are as large as possible. Indexing in small batches makes each update slower and also results in a much more fragmented index structure (see below).
The more often an index is updated, the more fragmented it becomes. Compressing an index, or completely rebuilding it, eliminates fragmentation. When updating an index, it is much more efficient to add many files at once because adding each file individually greatly increases the resulting fragmentation.
There is no specific limit on the number of documents or words in an index, nor does dtSearch place any limits on the size of a single document or the number of words or paragraphs in a document, except that a single document cannot be longer than 2 Gb.
An index can hold about 1 terabyte of documents in a single index. If the capacity limit for an index is exceeded during an index update, the index update will halt and the error code dtsErIndexFull will be returned.
Index size in relation to the original documents varies considerably depending on document size and type, and on the number of documents. For more information see: Index Size.
Two IndexJob settings that can significantly affect indexing performance are:
- IndexJob.MaxMemToUseMB
- IndexJob.AutoCommitIntervalMB
If MaxMemToUseMB is zero, dtSearch will decide the amount of memory to use based on the estimated amount of text to be indexed and the amount of system memory available.
If possible, dtSearch will use memory for all sorting operations; otherwise, some disk-based buffers will be used. For large updates, some disk-based sort buffers are always necessary and there is little benefit to MaxMemToUseMB values above 512 (32-bit) or 2048 (64-bit).
MaxMemToUseMB does not affect other memory that may be used during indexing for other purposes such as parsing document formats.
IndexJob.AutoCommitIntervalMB determines how often index updates are forced to commit. Higher values improve indexing performance. Unless the application requires commits before all data is indexed, the recommended value for AutoCommitIntervalMB is zero to avoid extra commits.