Hyphenation options

Last Reviewed: August 4, 2013

Article: DTS0154

 

Applies to: dtSearch (all versions)

The dtSearch Engine supports four options for the treatment of hyphens when indexing documents: spaces, searchable text, ignored, and "all three".

For most applications, treating hyphens as spaces is the best option. Hyphens are translated to spaces during indexing and during searches. For example, if you index "first-class mail" and search for "first class mail", "first-class-mail", or "first-class mail", you will find the phrase correctly.

Effect on Indexes

When an index is created, the hyphenation option currently in effect is stored in the index, and cannot be changed without re-creating that index. Therefore, the hyphenation option you select affects any indexes you create in the future, but it does not affect indexes that already exist.

When a user searches an index, the hyphenation option for that index applies to the user's search request.

How the options apply during indexing

During indexing, dtSearch extracts a stream of words from each document, and each word is assigned a number that represents that word's position in the file. The first word is assigned the position "1", the second word is assigned the position "2", and so forth. Consider a document that starts with the sentence, "I sent it by first-class mail". The following describes how the document would be treated under each of the hyphenation options:

1.   Hyphens treated as spaces:
I (1), sent (2), it (3), by (4), first (5), class (6), mail (7)

2.   Hyphens treated as searchable characters:
I (1), sent (2), it (3), by (4), first-class (5), mail (6)

3.   Hyphens ignored:
I (1), sent (2), it (3), by (4), firstclass (5), mail (6)

4.   All three:
I (1), sent (2), it (3), by (4), (5) first-class, (6) first-class, (5) first, (6) class, (5) firstclass
(6) firstclass, (7) mail

How the options apply during searching

During a search, dtSearch translates the search request according to the hyphenation option for the index being searched. For example, if you search for "first-class" in an index created with hyphens treated as spaces, the search request is translated into "first class".

During a search of an index created with the "all three" option, the search request is not modified. For example, if you search for "first-class", dtSearch will not search for "firstclass" or "first class".

Effects of the "all three" option

The "all three" option has one advantage over treating hyphens as spaces: it will return a document containing "first-class" in a search for "firstclass". Otherwise, it provides no benefit over treating hyphens as spaces, and it has some significant disadvantages:

1.   The "all three" option generates many extra words during indexing. For each pair of words separated by a hyphen, six words are generated in the index.

2.   It can produce unexpected results in searches involving longer phrases or words with multiple hyphens. With the "all three" option enabled, the sequence "a-b-c" would be indexed as: a (1), ab (1) a-b (1), ab (2), a-b (2), b (2), b-c (2), bc (2), b-c (3), bc (3), c (3). Thus, "a b c" would be found as would "a bc" or "ab c", but not "a-b-c" or "a-bc" or "ab-c". (To prevent the number of permutations from becoming excessive, dtSearch only permutes one hyphen at a time.)