Links
dtSearch Text Retrieval Engine Programmer's Reference
Performance and capacity
Building and Maintaining Indexes

Factors affecting indexing performance

Remarks
Indexing speed

Indexing speed can vary considerably depending on the type of data being indexed and the hardware being used. Generally indexing speed is between 30 and 120 MB/minute. A sample benchmark: 

Data: 170 GB, consisting of 4,373,004 mixed-type files (HTML, office documents, text) 

Indexing time: 24.7 hours (6.8 GB/hour) 

Unique words: 48,508,831 

Index size: 12% of original document size 

Hardware: PentiumŪ 4 Processor 550 (3.40GHz, 800 FSB), 2GB RAM, internal SATA RAID-0 drives 

For more samples, see http://www.dtsearch.com/index7.html

Update size

The dtSearch indexer is designed to operate best when indexing large volumes of text at once. Therefore, it is preferable to index data in batches that are as large as possible. Indexing in small batches makes each update slower and also results in a much more fragmented index structure (see below).

Fragmentation

The more often an index is updated, the more fragmented it becomes. Compressing an index, or completely rebuilding it, eliminates fragmentation. When updating an index, it is much more efficient to add many files at once because adding each file individually greatly increases the resulting fragmentation.

Capacity and size

There is no specific limit on the number of documents or words in an index, nor does dtSearch place any limits on the size of a single document or the number of words or paragraphs in a document, except that a single document cannot be longer than 2 Gb. 

The version 7 index format can hold about 1 terabyte of documents in a single index. The version 6 index format can hold about 4-8 Gb of documents. If the capacity limit for an index is exceeded during an index update, the index update will halt and the error code dtsErIndexFull will be returned. 

Index size in relation to the original documents varies considerably depending on document size and type, and on the number of documents. In the best case (many word processing documents, each more than a few thousand words in length), the index may be 15-25% of the size of the original files. Some factors that can make the ratio worse include: short documents, because the per-document overhead becomes more costly; databases, because each row in a database is usually indexed as a separate (and generally very small) document; compressed files such as ZIP or PDF files, because they contain a higher density of text-per-kilobyte than uncompressed files; numeric data, because the numbers generate more unique words than normal text. 

To make indexes smaller, (1) use a noise word list optimized for your text to ignore as many common words as possible; (2) use the "treat-hyphens-as-spaces" indexing option (the default); (3) if your application permits, do not index numbers (there is an indexing option to suppress all numbers) or disable numeric range searching using the dtsoTfSkipNumericValues in Options.TextFlags); (4) avoid indexing binary files such as executable programs.

IndexJob settings

Two IndexJob settings that can significantly affect indexing performance are:

  • IndexJob.MaxMemToUseMB
  • IndexJob.AutoCommitIntervalMB

MaxMemToUseMB controls the size of the memory buffers that dtSearch can use to sort words. If possible, dtSearch will use memory for all sorting operations; otherwise, some disk-based buffers will be used. For large updates (10 GB or more of text), some disk-based sort buffers are always necessary and there is little benefit to MaxMemToUseMB values above 512. 

IndexJob.AutoCommitIntervalMB determines how often index updates are forced to commit. Higher values improve indexing performance.

Data source indexing

When indexing using the Data Source API, the data source implementation may add significant overhead, especially when using the COM interface (because the COM API layer is slow). Using a faster API such as the C++ API or the .NET API can minimize this effect.

Hardware resources and configuration

Memory. The indexer uses memory for sort buffers and for efficient access to index structures, so the amount of memory available can have a major effect on indexing performance. For large indexes (indexes of more than 50 Gb of text), 2 Gb of installed RAM is recommended. 

Disk space. For best performance building large indexes, ensure that the drive where the index is located has free space of more than 60% of the size of the documents to be indexed. For example, to index 100 Gb of text, there should be 60 Gb of free space. 

Index location. Indexing is much faster if the index is physically located on the same computer where the dtSearch Engine is running. Building an index across a network connection (with the index located on a remote drive) will be substantially slower. 

Document location. The location of the documents has relatively little effect on performance. dtSearch reads the documents once to index them, so as long as there is a reasonably fast network connection to the documents, they can be anywhere. 

Other. Use of NTFS folder compression or NTFS encryption will substantially slow disk access and will severely impair indexing performance. 

Using multiple computers to index

To minimize indexing time, you can index portions of the data on separate computers. You can then either keep the indexes separate (a single search can span any number of indexes) or you can merge the resulting indexes into one large index. Merging the data into a single new index produces a fully-optimized index structure, which will make searching substantially faster.

Group
Links
You are here: Overviews > Building and Maintaining Indexes > Performance and capacity
Copyright (c) 1995-2008 dtSearch Corp. All rights reserved.