Index size and performance

Article: dts0142

Applies to: dtSearch 7 and later

Overview

dtSearch works by building an index of your documents. You can create as many indexes as you want, and in a search you can search any or all of them by clicking on the ones you want to search.

An index is generally about 1/8 to 1/3 the size of the original documents. The ratio of the index size to the size of the original documents depends on the type of documents, the size of the documents, and certain indexing options (discussed below). The ratio is better for word processing documents (which contain less text per kilobyte than plain text files) and large files (which reduce the effect of the per-document overhead).

A dtSearch index can hold over 1 terabyte of text. Apart from the 1 terabyte limit on the total amount of text in an index, the maximum number of documents in a single index is 2 billion.

For information on factors affecting indexing speed for very large document collections, see Optimizing indexing of large document collections.

How can the size of an index be minimized?

The following are some option settings that affect the size of a dtSearch index.

Caching Documents and Text

In addition to storing word locations to enable fast searching, dtSearch indexes can also store the text of documents to make them open faster after a search. dtSearch indexes can optionally store documents in either, or both, of two ways: (1) the entire original file can be stored, or (2) just the text of the file can be stored. Option settings in the "Create Index (Advanced)" dialog box enable these features when an index is created. In the developer API, the indexing flags dtsIndexCacheText and dtsIndexCacheOriginalFile enable caching.

Cached data is stored ZIP-compressed in the index. Enabling either type of caching will increase the index size considerably.

dtSearch Desktop: Index > Create (Advanced) The default is for nothing to be cached in the index.

dtSearch Developer API
Set the flags dtsIndexCacheText and dtsIndexCacheOriginalFile in IndexJob to enable caching

Database files

dtSearch normally indexes database files such as Microsoft Access or CSV with each row indexed as a separate document. As a result, if a CSV file has 500,000 rows, then that file will be indexed as 500,000 documents. If you do not need to be able to search for every row of a database file as a separate document, you can instead tell dtSearch to index databases as plain text. With this option selected, database files such as Microsoft Access and CSV files are indexed without treating each row as a separate document, and without including field attributes. All of the text, including field names, remains searchable, but database content is combined into a single plain text document, which makes indexing and searching faster.

dtSearch Desktop: Options > Preferences > Indexing Options, "Index databases as plain text"

dtSearch Developer API
Set the flag dtsoFfDatabasesAsText in Options.TextFlags

XML Files

dtSearch normally indexes the complex, nested field structure of XML files completely, enabling search requests to consider all or part of the XML hierarchy.  (For more information on field searching in XML, see Field Searching.)  This additional field information adds complexity to the index structure, affecting both indexing and searching performance, so it should be disregarded if not needed for searching.  When dtSearch indexes XML as plain text using the options below, it still parses the XML and interprets elements such as entities so the text is correctly extracted.  All of the words in the XML, including field names, will be searchable.

dtSearch Desktop: Options > Preferences > Indexing Options, "Index XML files as plain text"

dtSearch Developer API: Set the flag dtsoFfXmlAsText in Options.TextFlags

Binary Files

The binary files option setting controls the way dtSearch treats files in a format that it does not recognize as a document. There are three options: (1) index the files completely, (2) filter out only the text of the files, and (3) skip the files entirely.

Filtering or skipping binary files can greatly reduce index size, and improve indexing speed. Indexing binary files completely can have a large effect on index size because binary files do not contain a normal mix of words. If treated as text, they produce a large number of unique, random text sequences, which then bloat the word list portion of the index disproportionately.

Filtering is also more effective at extracting text than simply indexing binary files entirely. This is because text may be present in a variety of formats (for example, blocks of Unicode text are often mixed with blocks of single-byte text), and the filtering algorithm can identify and decode these segments.

Exclude filters are another good way to minimize the number of binary files included in an index.

dtSearch Desktop: Options > Preferences > Filtering Options

dtSearch Developer API: Set the Options.BinaryFiles flag.

Title Size

If the documents being indexed are small, the per-document overhead in an index may be a relatively large portion of the total index size. For each document, dtSearch stores the filename and location, the modification date, size, and other properties, as well as a "title" which is usually the first 80 characters of text from the file. To reduce the size of the index, the title size can be changed to a smaller value. Additionally, reductions in the size of the filenames in the index, including the folder name, will save space.

See "How to change the size of the 'title' field" for information on changing this setting.

Noise Words

A noise word list can reduce the size of an index by eliminating common words like "the" or "if". By default, dtSearch will index documents using a noise word list for the English language.

dtSearch Desktop: Options > Preferences > Letters and Words

dtSearch Developer API: Set Options.NoiseWordFile to the name of the noise word list to use.

Numbers

dtSearch has an indexing option to skip indexing numbers. If the documents being indexed contain many numbers, and if these numbers do not have to be searchable, this setting can reduce index size considerably.

dtSearch Desktop: Options > Preferences > Indexing Options > "Index numbers"

dtSearch Developer API: Set Options.IndexNumbers to zero to disable indexing of numbers.

Numeric Values

By default, dtSearch indexes numbers both as text and as numeric values, which is necessary for numeric range searching. If you do not need to search for numbers as numeric ranges, you can disable indexing numeric values. Numbers will still be searchable as text if indexing of numbers is enabled (see above). This setting can reduce the size of your indexes by about 20%.

dtSearch Desktop: Options > Preferences > Indexing Options > "Enable numeric range searching"

dtSearch Developer API: Set the flag dtsoTfSkipNumericValues in Options.TextFlags to disable indexing of numeric values.

Hyphens

The option setting to treat hyphens as spaces produces smaller indexes than any of the other option settings.

See "Hyphenation options" for information on changing this setting.

What is the effect of index size on searching performance?

A dtSearch search essentially consists of two steps: (1) looking up the words in the search request, and (2) enumerating the documents that match that request.

The word lookup step is usually very quick and takes a small fraction of the total time required for the search. A wildcard at the beginning of a search term (for example, a search for "*abc") can be slow because dtSearch uses letters at the start of a word to implement fast word searches, and dtSearch cannot do this if the start of the search term is unknown.

The time required for the second step, enumerating the documents, depends on the number of documents found rather than the size of the index. (For developers, using the dtsSearchDelayDocInfo flag can minimize the time required for this step, making searches that retrieve many files much faster. For more information, please see "Optimizing search performance with the dtSearch Engine")

What is the effect of index size on indexing performance?

The dtSearch indexer is designed to operate best when indexing large volumes of text at once. Therefore, it is preferable to index data in batches that are as large as possible. Indexing in small batches makes each update slower and also results in a much more fragmented index structure.

What is the effect of document type on index size?

The size of the index as a fraction of the original document size depends on how much text the document contains per kilobyte of data. The more text the document contains, the larger the index. For example, if you index a 20k Microsoft Word document and a 20k text file, the 20k text file will add much more data to the index than the Word document. A 20k Word document will consist mostly of formatting information, leaving only 10k or less of text, so it adds less than half as much data to the index as the text file.

In some cases, it is even possible for the index to be larger than the original documents. This can happen if the documents are in a compressed format such as ZIP archives or a PDF files (PDF files store text in a compressed stream).

Indexes of database files, such as MDB (Microsoft Access) or DBF (XBase) files, are also usually a large fraction of the original document size. dtSearch indexes each record of a database file as a separate document. As a result, the per-document overhead in the index becomes a much larger factor in the index size. Indexing databases as text, as described above, can eliminate this extra overhead.