Links
dtSearch Text Retrieval Engine Programmer's Reference
Hyphens
Options | Send Feedback

The Options.Hyphens setting is a HyphenSettings value that determines how hyphen characters are indexed

Remarks

The dtSearch Engine supports four options for the treatment of hyphens when indexing documents: spaces, searchable text, ignored, and "all three". 

For most applications, treating hyphens as spaces is the best option. Hyphens are translated to spaces during indexing and during searches. For example, if you index "first-class mail" and search for "first class mail", "first-class-mail", or "first-class mail", you will find the phrase correctly.

Values
Meaning 
dtsoHyphenAsIgnore 
index "first-class" as "firstclass" 
dtsoHyphenAsHyphen 
index "first-class" as "first-class" 
dtsoHyphenAsSpace 
index "first-class" as "first" and "class" 
dtsoHyphenAll 
index "first-class" all three ways 

Effect on Indexes 

When an index is created, the hyphenation option currently in effect is stored in the index, and cannot be changed without re-creating that index. Therefore, the hyphenation option you select affects any indexes you create in the future, but it does not affect indexes that already exist. 

When a user searches an index, the hyphenation option for that index applies to the user's search request. 

How the hyphens option applies during indexing 

During indexing, dtSearch extracts a stream of words from each document, and each word is assigned a number that represents that word's position in the file. The first word is assigned the position "1", the second word is assigned the position "2", and so forth. Consider a document that starts with the sentence, "I sent it by first-class mail". The following describes how the document would be treated under each of the hyphenation options:

  1. Hyphens treated as spaces:
I (1), sent (2), it (3), by (4), first (5), class (6), mail (7)
  1. Hyphens treated as searchable characters:
I (1), sent (2), it (3), by (4), first-class (5), mail (6)
  1. Hyphens ignored:
I (1), sent (2), it (3), by (4), firstclass (5), mail (6)
  1. All three:
I (1), sent (2), it (3), by (4), (5) first-class, (6) first-class, (5) first, (6) class, (5) firstclass (6) firstclass, (7) mail

How the hyphens option applies during searching 

During a search, dtSearch translates the search request according to the hyphenation option for the index being searched. For example, if you search for "first-class" in an index created with hyphens treated as spaces, the search request is translated into "first class". 

During a search of an index created with the "all three" option, the search request is not modified. For example, if you search for "first-class", dtSearch will not search for "firstclass" or "first class". 

Effects of the "all three" option 

The "all three" option has one advantage over treating hyphens as spaces: it will return a document containing "first-class" in a search for "firstclass". Otherwise, it provides no benefit over treating hyphens as spaces, and it has some significant disadvantages:

  1. The "all three" option generates many extra words during indexing. For each pair of words separated by a hyphen, six words are generated in the index.
  2. It can produce unexpected results in searches involving longer phrases or words with multiple hyphens. With the "all three" option enabled, the sequence "x a-b-c" would be indexed as: x (1), a (2), ab (2) a-b (2), ab (3), a-b (3), b (3), b-c (3), bc (3), b-c (4), bc (4), c (4), a-b-c(4). Thus, "a b c" would be found as would "a bc" or "ab c". However, a search for "x a-b-c" would not match because x is in word position 1 and a-b-c is in word position 4.
Group
Links
You are here: Overviews > Options > Hyphens
Copyright (c) 1995-2008 dtSearch Corp. All rights reserved.