dtSearch Text Retrieval Engine Programmer's Reference
Noise Words

The NoiseWordFile option setting is the name of a file with a list of words to skip when indexing documents.

The noise word list is a file containing a list of words, one per line, that dtSearch will ignore when indexing and searching. These are typically words such as "the" and "because" that are too common to be useful in search requests. 

If the noise word list includes non-English text, it should be saved in Unicode text format, with a byte-order marker so it is not ambiguous. 

The words in noise.dat do not have to be in any particular order, and can include wildcard characters such as * and ?. However, noise words may not begin with wildcard characters.

Effect on indexing

The dtSearch indexer will not index any words listed in the noise word list. However, noise words do affect word counting, so if "The Statue of Liberty" is indexed, the word offset of "statue" will be 2 and the word offset of "liberty" will be 4. 

When an index is created, a private copy of the noise word list is stored in the index in the index_n.ix file. Therefore, changes to the noise word list will not affect existing indexes. Index_n.ix is a text file, so you can view it in a text editor, but this is just a reference copy any should not be edited.

Effect on searches

When a search request includes noise words, dtSearch will skip them and process the rest of the request as if the noise word matched every word in the index. For example, a search for "the car" is processed as a search for "* car". 

The noise word list has no effect on unindexed searches.