You are here: Overviews > Options > Alphabet Settings
dtSearch Text Retrieval Engine Programmer's Reference
Alphabet Settings

The alphabet file determines which characters are treated as text, which cause a word break, and which are ignored.

Character categories

dtSearch classifies every character into one of four categories:

A searchable character. All of the characters in the alphabet (a-z and A-Z) and all of the digits (0-9) should be classified as letters.
A character that causes a word break. For example, if you classify the period (".") as a space character, then dtSearch would process U.S.A. as three separate words: U, S and A.
Hyphen characters can receive special processing in dtSearch. By default, only the '-' is defined as a hyphen. The Options.Hyphens setting controls how hyphens are treated.
A character that is disregarded in processing text. For example, if you classify the period as ignore instead of space then dtSearch would process U.S.A. as one word: USA.

For characters that are letters, you can specify whether the character is a lower case or upper case letter. For upper case letters, a lower-case equivalent can be designated. 

The alphabet settings only affect characters in the range 33-127. The Unicode specification controls the classification of other characters. See for more information about Unicode.

Alphabet files

The character categories and case mapping rules are specified in an alphabet file, which is a text file with a format similar to a Windows .ini file. To modify an alphabet file, you can use the "Edit Alphabet" dialog box in dtSearch Desktop (Options > Preferences > Letters and Words > Edit). 

When you create an index, dtSearch copies the current alphabet file into a file named index_a.ix in the index folder. Therefore, changes to the alphabet file will not affect existing indexes. In addition to the character settings described above, the index_a.ix file also contains the hyphen setting and the flag to enable CJK word breaking (see below).

Searching for punctuation

When making a punctuation character searchable, any associated search operator for that character should be redefined. For example, if you make the % character searchable, the fuzzy searching character should be redefined to something other than %. See Redefining Search Operators.

CJK Word Breaking

Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. 

To make this type of text searchable, you can enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters. With this option enabled, each character will be treated as single word. The flag to enable this feature is dtsoTfAutoBreakCJK in Options.TextFlags

You can specify which Unicode character ranges should have this treatment in the CJKRanges line at the end of an alphabet file. Example:

CJKRanges32 = 2e80-2fff 3021-ac00 ac00-d7af f900-faff fe30-fe4f, ff60-ff9f, 1f200-1f2ff, 20000-2ffff

This example designates the four ranges listed after "CJKRanges32 =" as characters that should be treated as separate words when automatic CJK word breaking is enabled.

Adding searchable Unicode characters

By default, all Unicode characters that are defined as letters in the Unicode specification are searchable. To make other characters such as Unicode currency characters searchable, you can add a line to the end of the alphabet file listing the Unicode characters to make searchable. Example:

AdditionalLetters32 = 00a2 00a3 00a4 00a5 20a0 20a1 20a2 20a3 20a4 20a5 20a6 20a7 20a8 20a9 20aa 20ab 20ac

This example makes all of the Unicode currency characters such as the Euro, Pound, and Lira searchable characters. 

TokenCharRanges32 specifies ranges of characters that are always indexed as separate words, regardless of the dtsoTfAutoBreakCjk setting. The default setting, enabling searches for emoji characters, is:  

TokenCharRanges32 = 1f000-1f0ff 1f300-1f6ff 1f700-1f77f 1f900-1f9ff

This setting requires that ICU integration be enabled.

Copyright (c) 1995-2023 dtSearch Corp. All rights reserved.