Alphabet Settings

The alphabet file determines which characters are treated as text, which cause a word break, and which are ignored.

Remarks

Character categories

dtSearch classifies every character into one of four categories:

Category	Meaning
letter	A searchable character. All of the characters in the alphabet (a-z and A-Z) and all of the digits (0-9) should be classified as letters.
space	A character that causes a word break. For example, if you classify the period (".") as a space character, then dtSearch would process U.S.A. as three separate words: U, S and A.
hyphen	Hyphen characters can receive special processing in dtSearch. By default, only the '-' is defined as a hyphen. The Options.Hyphens setting controls how hyphens are treated.
ignore	A character that is disregarded in processing text. For example, if you classify the period as ignore instead of space then dtSearch would process U.S.A. as one word: USA.

For characters that are letters, you can specify whether the character is a lower case or upper case letter. For upper case letters, a lower-case equivalent can be designated.

The alphabet settings only affect characters in the range 33-127. The Unicode specification controls the classification of other characters. See www.unicode.org for more information about Unicode. Generally, Unicode characters that are classified in Unicode as letters or digits will be searchable, and characters that are classified as punctuation will be treated as not searchable and will cause a word break.

Alphabet files

The character categories and case mapping rules are specified in an alphabet file, which is a text file with a format similar to a Windows .ini file. To modify an alphabet file, you can use the "Edit Alphabet" dialog box in dtSearch Desktop (Options > Preferences > Letters and Words > Edit).

When you create an index, dtSearch copies the current alphabet file into a file named index_a.ix in the index folder. Therefore, changes to the alphabet file will not affect existing indexes. In addition to the character settings described above, the index_a.ix file also contains the hyphen setting and the flag to enable CJK word breaking (see below).

Searching for punctuation

When making a punctuation character searchable, any associated search operator for that character should be redefined. For example, if you make the % character searchable, the fuzzy searching character should be redefined to something other than %. See Redefining Search Operators.

CJK Word Breaking

Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word.

To make this type of text searchable, you can enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters. With this option enabled, each character will be treated as single word. The flag to enable this feature is dtsoTfAutoBreakCJK in Options.TextFlags.

You can specify which Unicode character ranges should have this treatment in the CJKRanges line at the end of an alphabet file. Example:

    CJKRanges32 = 2e80-2fff 3021-ac00 ac00-d7af f900-faff fe30-fe4f, ff60-ff9f, 1f200-1f2ff, 20000-2ffff

This example designates the four ranges listed after "CJKRanges32 =" as characters that should be treated as separate words when automatic CJK word breaking is enabled.

The CJKRanges32 setting controls word breaking only does not affect the classification of characters as letters (searchable) or punctuation (not searchable).

Adding searchable Unicode characters

By default, all Unicode characters that are defined as letters in the Unicode specification are searchable. To make other characters such as Unicode currency characters searchable, you can add a line to the end of the alphabet file listing the Unicode characters to make searchable. Example:

    AdditionalLetters32 = 00a2 00a3 00a4 00a5 20a0 20a1 20a2 20a3 20a4 20a5 20a6 20a7 20a8 20a9 20aa 20ab 20ac

This example makes currency characters such as the Euro, Pound, and Lira searchable characters.

TokenCharRanges32

List ranges of characters that are always indexed as separate words, regardless of the dtsoTfAutoBreakCJK setting. The default setting, enabling searches for emoji characters as separate tokens, is:

TokenCharRanges32 = 1f000-1f0ff 1f300-1f6ff 1f700-1f77f 1f900-1f9ff

This feature requires that ICU integration be enabled. If TokenCharRanges32 are specified without ICU integration, the specified characters will be tokenized only if the dtsoTfAutoBreakCJK flag is set.

Older alphabet files from pre-7.93 dtSearch versions

Older alphabet files had a CJKRanges setting instead of CJKRanges32. CJKRanges supported 16-bit ranges only. If an alphabet file has only CJKRanges and no CJKRanges32, dtSearch will supply the default 32-bit values (1f200-1f2ff, 20000-2ffff) when creating an index. To prevent this default, add a CJKRanges32 line to the alphabet specifying the ranges to be covered.

Older alphabet files did not have TokenCharRanges32. If the TokenCharRanges32 setting is not present in an alphabet file, the default ranges listed above will be used.

Group

Options