Optimizing indexing of large document collections

Article: dts0206

For general information on dtSearch indexing, please see Indexing Overview.

See also: How to index Outlook and Exchange messages with dtSearch

General Indexing Strategy

• Index in larger batches.

• Use the index compress function after multiple index updates.

• For very large indexing jobs, index on multiple machines running simultaneously, and then merge the indexes.

• When merging, merge indexes into a new empty index, rather than merging into an index that already contains data.

• Do not require the indexer to “commit” index updates too often (dtSearch Engine users only).

Index in larger batches. The dtSearch indexer is optimized for indexing large volumes of text at once. Indexing in small batches makes each update relatively slower and fragments the index structure.

Use the compress function after multiple index updates. For optimal search speed, after many index updates, use the compress function to defragment the index.

Index on multiple machines running simultaneously. For very large indexing jobs, using multiple machines to simultaneously build indexes on different portions of a data collection is generally much faster than indexing on a single machine. Splitting up the indexing job is also a good strategy if disk space is insufficient to index all data at once. Multiple index updates can also run concurrently on the same machine, in separate processes or on multiple threads in the same process. For information on multithreaded use of the dtSearch Engine API and indexing using multiple threads, see Multithreaded operations.

Merge multiple indexes into a new, empty index. After creation of multiple individual indexes, you can run searches across all indexes at once. (A single dtSearch query can search any number of indexes.) Or, for optimal index structure and search efficiency, merge the multiple indexes into a single index.

Merging indexes into a new, empty index—rather than merging into an index that already contains data—results in a substantially faster and more efficient merge process. Make sure, however, that the final index holds no more than about a terabyte of text.

”Commit” index updates infrequently. The dtSearch Engine API provides a setting, IndexJob.AutoCommitIntervalMB, that determines how often dtSearch must commit index updates. Higher values improve indexing performance. For best performance, set AutoCommitIntervalMB to a value greater than 64,000. Alternatively, you can set AutoCommitIntervalMB to zero, which requires dtSearch to commit only once at the end of an indexing job.

dtSearch Desktop: This setting is not currently available in dtSearch Desktop.
dtSearch Developer API:
Set AutoCommitIntervalMB to a value of either 0 or greater than 64,000.

Index and Document Location

• Use SSD storage for the index and, if possible, for the documents as well.

• Keep the indexes as close as possible to the machine where the indexer is executing—even if the data is remote.

• Avoid generating an index on an external drive (SAN, NAS, USB).

• Do not generate or store an index on a compressed or encrypted NTFS folder. Other more efficient forms of encryption such as BitLocker-encrypted drives affect indexing speed much less.

• If you are accessing data across potentially unreliable network connections (for example, crawling a large variety of web sites), download the data prior to indexing.

• Consider using the dtSearch document caching feature, particularly for use with web-based data that changes frequently or that may not be available in the future or for PST files.

Use SSD storage for the index. Generating the index requires a high volume of read/write activity to and from the index, and SSD storage is much faster than non-SSD drives. Document access speed is less important but can become significant if the multithreaded indexer is being used.

Keep the indexes close to the indexer. It is better to build the index on an internal drive on the machine where the indexer is running, rather than generating an index on a remote drive or external drive. This is more efficient even if the indexes must be copied to a network drive after they are created. If the indexes must be generated on a network drive, please see this article for suggestions to improve performance and minimize potential issues with network I/O errors: Using dtSearch with network storage devices.

While the index files should remain close to the indexing engine, it does not matter as much where the target data resides. Unlike the index building process, which requires a large amount of read/write activity, the indexer must read the target data just once, making the location of the target data far less critical.

Avoid generating an index on an external drive such as USB or network storage devices. Generating an index on external drives can cause a substantial reduction in indexing performance due to slower I/O. If you want to ultimately store an index on an external drive, build it on an internal drive and use a copy program such as Robocopy to copy it over to the external drive after completing the index.

If you do copy an index to an external drive, and if the documents are located on the same drive as the index, ensure that the documents move in tandem with the indexes so relative path references from the index to the documents will remain valid in the new index location. Alternatively, you can also use the dtSearch caching feature to accomplish a similar result (see discussion below).

If it is necessary to build an index on a network storage device, please see this article for additional information: Using dtSearch with network storage devices

Do not use compressed or encrypted NTFS folders to store or build indexes. Compressed or encrypted NTFS folders impose a severe performance penalty on both indexing and searching.

Avoid accessing data through unreliable network connections. If a web-based or other network connection to the data is unreliable, download the data first. Downloading the data instead of using an unreliable network connection results in greater efficiencies both in the initial indexing, as well as the display of the retrieved data with highlighted hits.

Products such as WinHTTrack or Offline Explorer Pro can download web sites to local folders. The download approach also has the advantage that it separates the web site crawl from the indexing work. Separating these two tasks can result in efficiencies in the performance of both.

Consider the caching feature for use with rapidly changing or sporadically available web-based or other remote data. When dtSearch displays a retrieved document or web page, it refers back to the original document or web page to display highlighted hits using hit offset information in the index. If the document or web page has changed since the last update, then the hit highlights will not be in the correct place.

Since dtSearch can use the cached text to display hit highlights correctly, hit highlighting in cached pages will always be consistent, even if the original page has changed since the last index. The caching feature also ensures that dtSearch can display retrieved pages, even if the original pages are removed, offline, or otherwise inaccessible through an erratic connection.

PST files, even if local, can be time-consuming for dtSearch to open after a search, so if you are indexing a large amount of PST data then enabling caching can make reviewing retrieved files much faster after a search.

dtSearch Desktop/Network: Specify the index location in the Create Index or Create Index (Advanced) dialog box. To enable caching, using the Create Index (Advanced) dialog box.

dtSearch Developer API: Specify the index location in IndexJob.IndexPath. To enable caching, set the caching flags in IndexJob.IndexingFlags.

Other Software

• Consider disabling on-access virus scanning of the folder containing the index.

• If possible, avoid indexing with IFilters, as some can result in speed and stability problems.

Consider disabling virus scanning of the folder containing the index. On-access virus scans are generally smart enough not to impair indexing performance, but this will depend on the specific product, so in some cases it may be beneficial to configure your antivirus software not to scan files in the folder containing your index. On-access scans of document folders will have a much more minor effect on indexing performance, and provide an important security benefit, so we do not recommend disabling on-access scans of document folders.

Avoid IFilters, if possible. While dtSearch supports using IFilters, they may be slower and less stable than dtSearch’s built-in file parsers. We recommend that you do not use IFilters for large indexing jobs unless for some reason a particular IFilter is absolutely necessary.

dtSearch Desktop: dtSearch does not use IFilters by default; IFilter integration is disabled unless you enable IFilter support in the Options > Preferences > File Types dialog box.
dtSearch Developer API:
The dtSearch Engine does not use IFilters by default. IFilter integration is controlled using Options.FileTypeTableFile, which specifies the location of a file type table file in the format generated by dtSearch Desktop (filetype.xml).

Indexing Resources

For small index updates (less than 20 GB of data), dtSearch can work efficiently with limited memory and disk space of at least 60% of the size of the data to be indexed. For larger index updates,

• The machine building the index should have 16 GB RAM or more for larger indexing jobs.

• For multithreaded index updates, the machine should have at least 4 GB per indexing thread, and the number of indexing threads should be less than the number of CPU cores.

• Use the 64-bit version of the dtSearch indexer.

• The drive where the index will reside should have free space of at least 15% of the size of the original data, plus 16-32 GB for temporary workspace.

• Avoid running many indexers at the same time on the same computer.

• If possible, redirect temporary sort buffer creation to an internal drive other than the one containing the index.

Have 16 GB RAM or more available for larger indexing jobs. While you can limit the amount of memory the indexer will use for in-memory sort buffers, for best indexing performance, let the dtSearch indexer decide how much memory to use based on available system resources, rather than specifying a limit. dtSearch indexer does this by default in all dtSearch products except the dtSearch Engine.

dtSearch Desktop: Click Options > Preferences > Indexing Resources to control the amount of memory dtSearch uses during indexing.
dtSearch Developer API:
Set IndexJob.MaxMemToUseMB.

Multithreaded index updatesMultithreaded index updates can be several times faster than single-threaded indexing but the hardware requirements are correspondingly greater. The machine should have at least 4 Gb per indexing thread, and the number of indexing threads should be less than the number of CPU cores. Additionally, read access to the original documents can become a bottleneck if the documents are on relatively slow storage or accessed across a potentially slow network connection.

Use the 64-bit version of the dtSearch indexer. The 64-bit version of the dtSearch indexer is faster. Also, it can use much more memory, which allows it to operate more efficiently for very large indexing operations. The 64-bit version of the dtSearch indexer is dtindexer64.exe, and is installed into the dtSearch\bin64 folder. For information on using the 64-bit version of the dtSearch Engine, please see How to use the 64-bit version of the dtSearch Engine.

Ensure sufficient disk space. The final index will be about 15% of the size of the original documents (for smaller indexes, the ratio will usually be higher). In addition, for large index updates, at least 16 GB, and preferably 32 GB, of disk space should be available during indexing.

Avoid running many indexers at the same time on the same computer. Indexing uses system resources -- CPU, memory, and disk -- very heavily. As a result, whether running multiple indexers concurrently provides any performance benefit will depend on the specific capabilities of the hardware. Additionally, when indexes are located on a network share, such as a SAN, running multiple indexers at the same time can cause unpredictable spikes in network I/O (because all of the indexers may be writing to the index folders on the network at the same time), which can lead to network write errors and corrupt indexes.

Redirect temporary sort buffer files to a different internal drive. By default, dtSearch creates temporary sort buffers in the index folder. If you have multiple internal drives, you can redirect such files to a different internal drive from the one holding the index.

If you have to build an index on a network drive, redirect temporary files to a local folder to minimize network traffic. This location should have free space of 32 Gb or more.

dtSearch Desktop: Use Options > Preferences > Indexing Resources >Temporary files to specify the folder to use for temporary file buffers.
dtSearch Developer API:
Set IndexJob.TempFileDir.

Efficient Text Processing

• Do not use case and accent sensitive indexing.

• Enable Unicode filtering for binary files.

• Index database files as text.

• Decide whether you want to enable automatic recognition of dates, email addresses, and credit card numbers.

• Disable numeric range searching, if your application does not require searches for numeric ranges.

• If possible, set the hyphenation option to treat hyphens as spaces.

• Use the noise word list to skip common words such as the.

It may seem that increasing the number of unique words in an index will increase the accuracy of a search. However, in many cases, increasing the number of unique words in an index can reduce the accuracy of a search by defining each incidence of a text occurrence too nrarrowly. In addition, increasing the number of unique words can also dramatically increase the index size and the index building time.

Avoid case and accent sensitive indexing. With case-sensitive indexing on, the indexer would consider World, world, and WORLD as completely different words. Storing each of these words separately increases the size of the index and makes indexing slower. Even more importantly, it also increases the chance that a user searching for world would miss World and WORLD.

Accordingly, we do not recommend using case- or accent-sensitive indexing except in highly unusual situations where case- or accent-sensitive searching is absolutely necessary. By default, dtSearch indexes are not case- or accent-sensitive.

Enable Unicode filtering for binary files. Another text-related indexing setting affecting both search accuracy and index efficiency is the dtSearch Unicode filter for binary files. Without this filter, massive amounts of useless random data will clog indexes of binary files, and the text indexing process may miss critical data that does not appear in consecutive form in the binary file. For more details on why filtering improves both efficiency and accuracy, see "Why filtering improves accuracy when searching forensic data" at the end of this article.

dtSearch Desktop: Click Options > Preferences > Filtering Options, and check the “Filter text” option under “Binary files” to enable filtering of binary files.
dtSearch Developer API:
Set Options.BinaryFiles = dtsoFilterBinaryUnicode.

Index database files as text

dtSearch normally indexes database files such as Microsoft Access or CSV with each row indexed as a separate document. As a result, if a CSV file has 500,000 rows, then that file will be indexed as 500,000 documents. If you do not need to be able to search for every row of a database file as a separate document, you can instead tell dtSearch to index databases as plain text. With this option selected, database files such as Microsoft Access and CSV files are indexed without treating each row as a separate document, and without including field attributes. All of the text, including field names, remains searchable, but database content is combined into a single plain text document, which makes indexing and searching faster.

dtSearch Desktop: Options > Preferences > Indexing Options, "Index databases as plain text"

dtSearch Developer API
Set the flag dtsoFfDatabasesAsText in Options.TextFlags

Decide whether you want to enable automatic recognition of dates, email addresses, and credit card numbers. Automatic recognition of dates, email addresses, and credit card numbers will make indexing indexing about 25% slower. However, it is a very powerful feature if you will need to search for these types of data. For more information on how these features work, please see:

Automatic recognition of dates, email addresses, and credit card numbers

Disable numeric range searching, if possible. Numeric range searching requires dtSearch to index each number twice, once in its text form and once in its numeric value form. This feature adds about 10-20% to the size of a typical index. Accordingly, if an application does not require numeric range searching, disabling indexing of numeric values will result in better indexing efficiency. Note that disabling numeric range searching continues to allow searching of numbers as text.

dtSearch Desktop: Click Options > Preferences > Indexing Options, and un-check the box to “Index numeric values” to disable numeric range searches.
dtSearch Developer API:
Set Options.TextFlags = dtsoTfSkipNumericValues.

Keep the default treatment of hyphens as spaces. Through alphabet customization, dtSearch can index hyphenated words in multiple permutations. (For example, dtSearch can index world-class as world class, worldclass and world-class to ensure retrieval no matter which of these variants a user types in.) Treating hyphens as spaces, however, results in more efficient indexing.

dtSearch Desktop: Treating hyphens as spaces is now the default. To change the hyphens setting, click Options > Preferences > Letters and Words.
dtSearch Developer API:
set Options.Hyphens = dtsoHyphenAsSpace

Use the noise word list. A search engine can reduce index size and make searching faster by ignoring a few dozen words that are so common as to be, for purposes of searching, mere “noise.” For example, the, of and for are all in the dtSearch noise word list. For information on non-English noise word lists, see https://www.dtsearch.co.uk/.

dtSearch Desktop: To edit the noise word list, click Options > Preferences > Letters and Words.
dtSearch Developer API:
Set Options.NoiseWordFile to the name of a text file to use as the noise word list before you create an index.

Why filtering improves accuracy when searching forensic data

Binary files are files that dtSearch does not recognize as documents. Examples of binary files include executable programs, fragments of documents recovered through an “undelete” process, or blocks of unallocated or recovered data obtained through computer forensics. Content in these files may appear in a variety of formats, such as plain text, Unicode text, or fragments of .doc or .xls files. Many different fragments with different encodings may be present in the same binary file.

Indexing such a file as if it were a simple text file would miss most of the content. In contrast to a simple text scan, the dtSearch filtering algorithm scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm can detect sequences of text with different encodings or formats in the same file, so as to better extract text from recovered or corrupt data.

In forensic applications, when complete and accurate results are critical, investigators may be reluctant to enable a “filtering” feature out of concern that they will miss something, even if disabling filtering makes indexing slower. In reality, filtering improves completeness and accuracy, and without it investigators will probably miss much of the useful data in the files they are searching.

For example, this is a hex view of how some text from this article might appear in a fragment of a recovered Word document:

Offset 0 1 2 3 4 5 6 7 8 9 A B C D E F

00009C00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

00009C10 FF FF FF FF 73 65 63 72 65 74 31 FF FF FF FF FF ÿÿÿÿsecret1ÿÿÿÿÿ

00009C20 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 7F ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

00009C30 FF FF FF 7F EC 37 93 00 00 00 00 00 B2 00 00 00 ÿÿÿì7“.....²...

00009C40 00 00 FF FF FF 4A 6F 68 6E 53 6D 69 74 68 FF FF ..ÿÿÿJohnSmithÿÿ

00009C50 FF FF 00 00 00 00 00 00 28 00 4D 00 61 00 6E 00 ÿÿ......(.M.a.n.

00009C60 61 00 67 00 69 00 6E 00 67 00 20 00 61 00 6E 00 a.g.i.n.g. .a.n.

00009C70 64 00 20 00 53 00 65 00 61 00 72 00 63 00 68 00 d. .S.e.a.r.c.h.

00009C80 69 00 6E 00 67 00 20 00 54 00 65 00 72 00 61 00 i.n.g. .T.e.r.a.

00009C90 62 00 79 00 74 00 65 00 73 00 20 00 6F 00 66 00 b.y.t.e.s. .o.f.

00009CA0 20 00 54 00 65 00 78 00 74 00 00 00 00 00 00 00 .T.e.x.t.......

All of the useful text actually present is broken up or embedded in garbage data, effectively making it unsearchable. A naive, unfiltered attempt to index this data would find the following words:

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿsecret1ÿÿÿÿÿÿÿÿ, ì7, ÿÿÿJohnSmithÿÿÿÿ, M, a, n, a, …

The dtSearch filtering algorithm would analyze the data more intelligently, enabling it to

• extract the word secret1 embedded in a long sequence of non-text characters,

• extract and separate the names John and Smith, and

• recognize that the data starting at offset 9C58 looks like Unicode, enabling it to identify the words Managing, Search, etc.

The dtSearch filtering algorithm works by analyzing the patterns of characters in the data. The dtSearch filtering algorithm makes no attempt to analyze the meaning of the language present, so the algorithm works with Arabic or Russian text, for example, as well as English.

Therefore, to retrieve as much as possible of the text present in fragments of recovered word processing files, spreadsheets, database data, and the like, enable the dtSearch filtering algorithm.

dtSearch Desktop: Click Options > Preferences > Filtering Options, and check the “Filter text” option under “Binary files” to enable filtering of binary files.
dtSearch Developer API:
Set Options.BinaryFiles = dtsoFilterBinaryUnicode.

Document Storage and the NTFS File System

• Distribute large numbers of files in a folder tree, so individual folders do not have more than a few thousand files.

• Disable Microsoft “8.3” short filename creation on NTFS partitions that contain a very large number of files.

• Use ZIP files to aggregate large numbers of files into a smaller number of archives.

Use a folder tree. In our experience and that of some of our customers, NTFS can become slow or unstable when storing very large numbers of files in a single folder. To avoid this problem, we recommend distributing documents in a folder tree, or aggregating documents into ZIP files, to reduce the number of files in individual NTFS folders.

Disable “8.3” short filenames. Changing a file system setting to disable creation of short “8.3” filenames can improve NTFS performance with large numbers of files in a single folder. For information on disabling short path names in Windows, please see https://support.microsoft.com/kb/121007. While making this change can improve NTFS performance, very old programs that rely on 8.3 filenames will not be able to access the data in these partitions.

Use ZIP archives. Aggregating documents into ZIP archives greatly reduces the number of files that NTFS must manage and will also reduce storage requirements. dtSearch can automatically index, search and display documents inside ZIP archives, and the effect on indexing speed is generally minor.