Unicode support

Article: dts0140

Applies to: dtSearch 7.93 and later

dtSearch supports indexing and searching Unicode text. This article will describe what is and is not covered in this support, and will provide additional information about how dtSearch Unicode support works with different operating systems and document types.

See also: International Language Features in dtSearch

Background

Unicode. Unicode is a specification that allows text in any language to be encoded in a consistent way. Detailed information on the Unicode specification is available at www.unicode.org.

UTF-8. UTF-8 is a widely-used, compact encoding of Unicode text that preserves all information in a Unicode string. For example, Java uses UTF-8 to provide Unicode support. In UTF-8, characters between 1 and 128 are encoded as Ansi characters 1 through 128. Other characters are encoded using character values greater than 128. UTF-8 encoded strings do not contain embedded NULL characters. Additional information on UTF-8 is available at www.unicode.org.

Fonts. If characters are appearing as small rectangles, your system font may not support display of characters in the language you are searching. Changing your display font to one that covers more of the Unicode character set, such as Microsoft's "Arial Unicode MS", will fix this. Use Windows "Display Options" to select this font for use in menus and message boxes. In dtSearch Desktop, use Options > Preferences > Display Options to select the font used to display documents in the dtSearch viewer window.

Keyboard and Character Sets. To add support for additional languages and keyboards to your Windows system, use the Regional Options tool in Control Panel.

The Windows charmap.exe program provides another way to enter non-English text. To access it, click Start > Programs > Accessories > System Tools > Character Map.

Unicode Code Points. Each Unicode character is identified using a numerical value, called a code point. The notation U+ with a hex number indicates the code point with that numerical value, so U+004a refers to the Unicode character 4a, which is the letter J.

dtSearch Support for Unicode

dtSearch Unicode support means that dtSearch can index and search documents containing Unicode-encoded data. dtSearch Unicode support is built into the dtSearch Engine (based on tables from www.unicode.org) and works on all 32-bit and 64-bit versions of Windows.

dtSearch versions 7.93 and later use the International Components for Unicode (ICU) library version 63.1 to implement Unicode-related features. This article describes the behavior of dtSearch with ICU, as is the case with dtSearch Desktop. For information on using ICU with applications built using the dtSearch Engine, please see ICU Integration (dtsearch.com).

File Formats

Microsoft Office

dtSearch can automatically recognize Unicode data in supported Microsoft Office formats.

HTML and XML

An HTML or XML file can include Unicode data if the HTML file uses the UTF-8 encoding. HTML files that are stored with the UTF-8 encoding contain a META tag in the beginning of the file that looks like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the file uses a different encoding, the META tag will contain a different charset= value, like this:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

dtSearch uses this tag to determine the encoding used by an HTML or XML file, and then applies the encoding to convert the data to Unicode for indexing and searching. If no tag is present, dtSearch will attempt to infer the encoding based on the contents of the file.

Text Files

Text files do not contain any encoding information, so dtSearch has to infer the encoding based on the contents of the file. You can use the Options > File Types dialog box in dtSearch Desktop, you can set up rules to tell dtSearch which default encoding to use for text files.

PDF

dtSearch can index and search Unicode characters in some, but not all, PDF files. Unlike other document formats, which usually contain text in some form, PDF files are essentially drawing instructions that provide information necessary to print a document on a printer or to draw it on the screen. Many PDF files contain character encoding information in addition to the drawing instructions, so the content of the PDF file can be converted back to text. In these types of PDF files, you can use the Text Select tool in Adobe Reader to select a block of text, copy the text to the clipboard, and paste it into another program like Notepad or Microsoft Word. If you can you use the Text Select tool in Adobe Reader to copy and paste text from a PDF file, it means that the file does contain meaningful character encoding information, and so dtSearch will probably be able to index and search the file correctly.

In some PDF files, however, only the drawing instructions are present, and the encoding information is either absent or random. As a result, there is no way to convert the file back to text. In these types of PDF files, Adobe Reader's Text Select tool will either (a) fail to work entirely, or (b) will copy text to the clipboard that is meaningless. dtSearch cannot index or search this type of PDF file, because the file is really just a picture of text but does not contain any words.

WordPerfect

WordPerfect files use the WordPerfect Character Set to express non-English text. dtSearch converts WordPerfect Character Set data to Unicode for indexing, so non-English text in WordPerfect files is supported.

Language Issues

Chinese, Japanese, Korean

Text in Chinese, Japanese, and Korean can be stored in, or converted to, Unicode, so dtSearch can search for words in these languages just as it can search for words in other languages. However, while dtSearch can search for literal word matches (or wildcard or fuzzy matches), there are some limitations on the support in dtSearch for Chinese, Japanese, and Korean text, described below.

(1) Dictionary-Based word breaking

Some documents store text in a way that does not separate the words with spaces. Instead, all of the text in a document is run together and a language-specific dictionary is needed to find word breaks. dtSearch does not have the ability to identify word breaks in these documents, because it does not include any language-specific dictionaries. To make this type of text searchable, you can enable an option in dtSearch to automatically insert of word breaks around Chinese, Japanese, and Korean characters. With this option enabled, each character will be treated as single word. In dtSearch Desktop, this option setting is in Options > Preferences > Letters and Words. In the developer API, the flag to enable this feature is dtsoTfAutoBreakCJK in Options.TextFlags.

(2) Variations in character forms and scripts

In these languages, the same text can be presented in different ways depending on the context. dtSearch will search for a word as it is provided in the search request and does not generate additional grammatical or script variations for words in Chinese, Japanese, and Korean.

For background information on handling text in these languages, and resources for software developers, see the CJK Institute site at www.cjk.org.

The dtSearch Engine has an API that can be used to integrate with dictionary-Based language analyzers from companies such as Basis Technologies . For more information, see How to integrate the dtSearch Engine with a language analyzer.

Word Prefixes and Suffixes (Arabic)

In some languages such as Arabic, the surrounding context for a word (my, your, the, a, masculine/feminine, etc.) can be expressed as characters added in front of or behind the word. For example, "the apple" or "my apple" would not be two words but would be different prefixes or suffixes added to "apple". To search for text in these languages, adding a * in the front and back of the word will pick up most of the variants, like this: *apple*.

Arabic and Hebrew PDF Files

Some PDF files store Arabic and Hebrew text in reversed order, from left to right, instead of the logical order in which the characters occur in the text (right to left). In these files, this means that every word is stored in the PDF file spelled backward, and every line of text has the words in reversed order. dtSearch checks for this condition when it indexes PDF files and inverts the order of the characters within reversed Hebrew and Arabic words, so these words will still be searchable. However, to enable hit highlighting to work, dtSearch does not reverse the order of words on each line, so words within a line will be indexed in the actual order they occur in the PDF file.

Accent-insensitive indexing

dtSearch can create indexes that are either "accent-sensitive" or "accent-insensitive." An accent-insensitive index converts characters, wherever possible, to a "base" character mapping which is either one of the letters A-Z or one of the digits 0-9. Diacriticals are removed from letters where the Unicode Standard defines a marking as a diacritical.

Accent-insensitive indexes are generally easier to use because they ensure that a document will be found even if the author omitted an accent, or if the user entering a search request omitted an accent, in typing a word. The following are examples of the character conversions done in an accent-insensitive index:

Character

Unicode value

Mapping

Å

U+00c5 (a with ring above)

A

ç

U+00e7 (c with cedilla)

c

U+2078 (superscript 8)

8

٨

U+0668 ( arabic-indic digit 8)

8

ΰ

U+03b0 (greek small letter upsilon with dialytika and tonos)

υ U+03c5 (greek small letter upsilon)

dtSearch will also map characters according to the "Compatible Equivalence" property in the Unicode Standard as defined in ICU. Examples:

Character

Unicode value

Mapping

U+2f0f (kangxi radical table)

几 U+51e0 (cjk unified ideograph-51e0)

U+304c (hiragana letter ga)

か U+304b (hiragana letter ka)

U+30ac (katakana letter ga)

カ U+30ab (katakana letter ka)

U+3131 (hangul letter kiyeok)

ᄀ U+1100 (hangul choseong kiyeok)

U+328c (circled ideograph water)

水 U+6c34 (cjk unified ideograph-6c34)

U+f941 (cjk compatibility ideograph-f941)

論 U+8ad6 (cjk unified ideograph-8ad6)

🈐

U+1f210 (squared cjk unified ideograph-624b)

手 U+624b (cjk unified ideograph-624b)

U+fb8e (arabic letter keheh isolated form)

ک U+06a9 (arabic letter keheh)

Mappings are done based on definitions in the Unicode Standard, so Ľ (U+013d L with caron) is mapped to L, but Ł (U+0142 L with stroke) is not mapped to L because the Unicode Standard does not define this mapping.

In an accent-sensitive index, each letter is converted to lower case where possible but otherwise characters are indexed using their Unicode values. In an accent-sensitive index, ç and c would be considered different letters, and a search for one would not find the other.

Alphabet Customization

dtSearch versions 5 and earlier used "alphabet" files with a .ABC extension to provide for customization of the handling of 8-bit characters. This made it possible to define, for each character in the range from 33 to 255, whether it was a letter or not and the rules for capitalization and accents. dtSearch still uses .ABC files, but only for characters in the range from 33 to 127. All other characters are handled according to the definitions in the Unicode character tables.

In dtSearch Desktop, you can use the Options > Preferences > Letters and Words dialog box to make certain punctuation characters searchable. Additionally, it is possible to make other technical changes in Unicode handling by editing the alphabet files directly in a text editor. For more information see Alphabet Settings (dtsearch.com).

Troubleshooting Encoding Problems

No accented letters appear in the indexed word list

Indexes are created accent insensitive by default. This means that all letters are converted to a-z whenever possible, and a search for é is considered equivalent to a search for e. Therefore, no accented letters will appear in the indexed word list. To make an accent-sensitive index, check the "accent sensitive" option in the Create Index dialog box when you create the index.

Text files appear incorrectly in dtSearch, and the words in the indexed word list have missing or scrambled accented characters

Please see this article for troubleshooting steps: Troubleshooting encoding detection