Hyphenation options

Last Reviewed: April 10, 2019

Article: DTS0154

 

Applies to: dtSearch (all versions)

The dtSearch Engine supports four options for the treatment of hyphens when indexing documents: spaces, searchable text, ignored, and "all three".

For most applications, treating hyphens as spaces is the best option. Hyphens are translated to spaces during indexing and during searches. For example, if you index "first-class mail" and search for "first class mail", "first-class-mail", or "first-class mail", you will find the phrase correctly.

Effect on Indexes

When an index is created, the hyphenation option currently in effect is stored in the index, and cannot be changed without re-creating that index. Therefore, the hyphenation option you select affects any indexes you create in the future, but it does not affect indexes that already exist.

When a user searches an index, the hyphenation option for that index applies to the user's search request.

How the options apply during indexing

During indexing, dtSearch extracts a stream of words from each document, and each word is assigned a number that represents that word's position in the file. The first word is assigned the position "1", the second word is assigned the position "2", and so forth. Consider a document that starts with the sentence, "I sent it by first-class mail". The following describes how the document would be treated under each of the hyphenation options:

1.   Hyphens treated as spaces:
I (1), sent (2), it (3), by (4), first (5), class (6), mail (7)

2.   Hyphens treated as searchable characters:
I (1), sent (2), it (3), by (4), first-class (5), mail (6)

3.   Hyphens ignored:
I (1), sent (2), it (3), by (4), firstclass (5), mail (6)

4.   All three:
I (1), sent (2), it (3), by (4), (5) first-class, (6) first-class, (5) first, (6) class, (5) firstclass
(6) firstclass, (7) mail

How the options apply during searching

During a search, dtSearch translates the search request according to the hyphenation option for the index being searched. For example, if you search for "first-class" in an index created with hyphens treated as spaces, the search request is translated into "first class".

During a search of an index created with the "all three" option, the answer depends on the dtSearch version used to search:

Effects of the "all three" option

The "all three" option has one advantage over treating hyphens as spaces: it will return a document containing "first-class" in a search for "firstclass". Otherwise, it provides no benefit over treating hyphens as spaces, and it has some significant disadvantages:

1.   The "all three" option generates many extra words during indexing. For each pair of words separated by a hyphen, six words are generated in the index.

2.   If hyphens are treated as significant at search time, it can produce unexpected results in searches involving longer phrases or words with multiple hyphens.

For example, with the "all three" option enabled, the sequence "a-b-c" would be indexed as: a (1), ab (1) a-b (1), ab (2), a-b (2), b (2), b-c (2), bc (2), b-c (3), bc (3), c (3), a-b-c (3). Thus, "a b c" would be found as would "a bc" or "ab c", but not "a-bc" or "ab-c". (To prevent the number of permutations from becoming excessive, dtSearch only permutes one hyphen at a time.) A search for "x a-b-c" would not match because x is in word position 1 and a-b-c is in word position 4.

Treating hyphens as significant

At search time, dtSearch can either treat hyphens as spaces or as significant in an "all three" index.  If hyphens are treated as significant, a search for "abc-def" will not find "abc def", which provides a way to prevent false positives in a search for a hyphenated expression.  The flag is also needed in cases where hyphens must not cause a word break. For example, if hyphens are treated as spaces, a search for the range "abc-001~~abc-099", which should match abc-032, would become a search for: abc 001~~abc 099, effectively breaking the range expression.

In dtSearch 7.94 and later, hyphens are treated as spaces by default.  To treat them as significant, in dtSearch Desktop, check the option to treat hyphens as significant in Options > Preferences > Letters and Words.  In the API, set the flag dtsSearchHyphenSignificant in SearchJob.