Links
dtSearch Text Retrieval Engine Programmer's Reference 7.86
Alphabet Settings
Options | Send Feedback

The alphabet file determines which characters are treated as text, which cause a word break, and which are ignored.

Remarks
Character categories

dtSearch classifies every character into one of four categories:

Category 
Meaning 
letter 
A searchable character. All of the characters in the alphabet (a-z and A-Z) and all of the digits (0-9) should be classified as letters. 
space 
A character that causes a word break. For example, if you classify the period (".") as a space character, then dtSearch would process U.S.A. as three separate words: U, S and A. 
hyphen 
Hyphen characters can receive special processing in dtSearch. By default, only the '-' is defined as a hyphen. The Options.Hyphens setting controls how hyphens are treated. 
ignore 
A character that is disregarded in processing text. For example, if you classify the period as ignore instead of space then dtSearch would process U.S.A. as one word: USA. 

For characters that are letters, you can specify whether the character is a lower case or upper case letter. For upper case letters, a lower-case equivalent can be designated. 

The alphabet settings only affect characters in the range 33-127. The Unicode specification controls the classification of other characters. See www.unicode.org for more information about Unicode.

Alphabet files

The character categories and case mapping rules are specified in an alphabet file, which is a text file with a format similar to a Windows .ini file. To modify an alphabet file, you can use the "Edit Alphabet" dialog box in dtSearch Desktop (Options > Preferences > Letters and Words > Edit). 

When you create an index, dtSearch copies the current alphabet file into a file named index_a.ix in the index folder. Therefore, changes to the alphabet file will not affect existing indexes. In addition to the character settings described above, the index_a.ix file also contains the hyphen setting and the flag to enable CJK word breaking (see below).

CJK Word Breaking

Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. 

To make this type of text searchable, you can enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters. With this option enabled, each character will be treated as single word. The flag to enable this feature is dtsoTfAutoBreakCJK in Options.TextFlags

You can specify which Unicode character ranges should have this treatment in the CJKRanges line at the end of an alphabet file. Example:

CJKRanges = 2e80-ac00 ac00-d7af f900-faff fe30-fe4f

This example designates the four ranges listed after "CJKRanges =" as characters that should be treated as separate words when automatic CJK word breaking is enabled.

Adding searchable Unicode characters

By default, all Unicode characters that are defined as letters in the Unicode specification are searchable. To make other characters such as Unicode currency characters searchable, you can add a line to the end of the alphabet file listing the Unicode characters to make searchable. Example:

AdditionalLetters = 00a2 00a3 00a4 00a5 20a0 20a1 20a2 20a3 20a4 20a5 20a6 20a7 20a8 20a9 20aa 20ab 20ac

This example makes all of the Unicode currency characters such as the Euro, Pound, and Lira searchable characters.

Group
Links
You are here: Overviews > Options > Alphabet Settings
Copyright (c) 1995-2016 dtSearch Corp. All rights reserved.