Unicode support

Last Reviewed: August 4, 2013

Article: DTS0140

Applies to: dtSearch 7

dtSearch supports indexing and searching Unicode text. This article will describe what is and is not covered in this support, and will provide additional information about how dtSearch Unicode support works with different operating systems and document types.

Contents

     Background

     dtSearch Support for Unicode

     File Formats

     Language Issues

     Alphabet Customization

     Troubleshooting Encoding Problems 

See also:  International Language Features in dtSearch

Background

Unicode. Unicode is a specification that allows text in any language to be encoded in a consistent way.  Detailed information on the Unicode specification is available at www.unicode.org.

UTF-8. UTF-8 is a widely-used, compact encoding of Unicode text that preserves all information in a Unicode string. For example, Java uses UTF-8 to provide Unicode support. In UTF-8, characters between 1 and 128 are encoded as Ansi characters 1 through 128. Other characters are encoded using character values greater than 128. UTF-8 encoded strings do not contain embedded NULL characters. Additional information on UTF-8 is available at www.unicode.org.
Fonts
. If characters are appearing as small rectangles, your system font may not support display of characters in the language you are searching. Microsoft Office contains a useful "Arial Unicode MS" font with coverage of nearly every character in every language included in the Unicode standard. Use Windows "Display Options" to select this font for use in menus and message boxes. In dtSearch Desktop, use Options > Preferences > Display Options to select the font used to display documents in the dtSearch viewer window.

Keyboard and Character Sets. To add support for additional languages and keyboards to your Windows system, use the Regional Options tool in Control Panel.

The Windows charmap.exe program provides another way to enter non-English text. To access it, click Start > Programs > Accessories > System Tools > Character Map.   

dtSearch Support for Unicode

dtSearch Unicode support means that dtSearch can index and search documents containing Unicode-encoded data. dtSearch Unicode support is built into the dtSearch Engine and works on all 32-bit and 64-bit versions of Windows. dtSearch can support Unicode even under non-Unicode versions of Windows because the necessary data, based on tables from www.unicode.org, is built into the dtSearch Text Retrieval Engine.

dtSearch supports 8-bit (UTF-8) and 16-bit (UCS-2) encodings of Unicode. UCS-32, a 32-bit encoding of Unicode that can express characters beyond the original 65,535 character limit in Unicode, is not yet supported.

File Formats

Microsoft Office

dtSearch can automatically recognize Unicode data in Microsoft Word, Excel and PowerPoint files.

HTML and XML

An HTML or XML file can include Unicode data if the HTML file uses the UTF-8 encoding. HTML files that are stored with the UTF-8 encoding contain a META tag in the beginning of the file that looks like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the file uses a different encoding, the META tag will contain a different charset= value, like this:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

dtSearch uses this tag to determine the encoding used by an HTML or XML file, and then applies the encoding to convert the data to Unicode for indexing and searching.  If no tag is present, dtSearch will attempt to infer the encoding based on the contents of the file.

Text Files

Text files do not contain any encoding information, so dtSearch has to infer the encoding based on the contents of the file.  You can use the Options > File Types dialog box in dtSearch Desktop, you can set up rules to tell dtSearch which default encoding to use for text files.

WordPerfect

WordPerfect files use the WordPerfect Character Set to express non-English text. dtSearch converts WordPerfect Character Set data to Unicode for indexing, so non-English text in WordPerfect files is supported.

PDF

dtSearch can index and search Unicode characters in some, but not all, PDF files. Unlike other document formats, which usually contain text in some form, PDF files are essentially drawing instructions that provide information necessary to print a document on a printer or to draw it on the screen. Many PDF files contain character encoding information in addition to the drawing instructions, so the content of the PDF file can be converted back to text. In these types of PDF files, you can use the Text Select tool in Adobe Reader to select a block of text, copy the text to the clipboard, and paste it into another program like Notepad or Microsoft Word. If you can you use the Text Select tool in Adobe Reader to copy and paste text from a PDF file, it means that the file does contain meaningful character encoding information, and so dtSearch will probably be able to index and search the file correctly.

In some PDF files, however, only the drawing instructions are present, and the encoding information is either absent or random. As a result, there is no way to convert the file back to text. In these types of PDF files, Adobe Reader's Text Select tool will either (a) fail to work entirely, or (b) will copy text to the clipboard that is meaningless. dtSearch cannot index or search this type of PDF file, because the file is really just a picture of text but does not contain any words.

Language Issues

Chinese, Japanese, Korean

Text in Chinese, Japanese, and Korean can be stored in, or converted to, Unicode, so dtSearch can search for words in these languages just as it can search for words in other languages. However, while dtSearch can search for literal word matches (or wildcard or fuzzy matches), there are some limitations on the support in dtSearch for Chinese, Japanese, and Korean text, described below.

(1) Dictionary-Based word breaking

Some documents store text in a way that does not separate the words with spaces. Instead, all of the text in a document is run together and a language-specific dictionary is needed to find word breaks. dtSearch does not have the ability to identify word breaks in these documents, because it does not include any language-specific dictionaries.  To make this type of text searchable, you can enable an option in dtSearch to automatically insert of word breaks around Chinese, Japanese, and Korean characters. With this option enabled, each character will be treated as single word.  In dtSearch Desktop, this option setting is in Options > Preferences > Letters and Words.  In the developer API, the flag to enable this feature is dtsoTfAutoBreakCJK in Options.TextFlags.  

(2) Variations in character forms and scripts

In these languages, the same text can be presented in different ways depending on the context. dtSearch will search for a word as it is provided in the search request and does not generate additional grammatical or script variations for words in Chinese, Japanese, and Korean.

For background information on handling text in these languages, and resources for software developers, see the CJK Institute site at www.cjk.org.

The dtSearch Engine has an API that can be used to integrate with dictionary-Based language analyzers from companies such as Basis Technologies .   For more information, see How to integrate the dtSearch Engine with a language analyzer.

Word Prefixes and Suffixes (Arabic)

In some languages such as Arabic, the surrounding context for a word (my, your, the, a, masculine/feminine, etc.) can be expressed as characters added in front of or behind the word. For example, "the apple" or "my apple" would not be two words but would be different prefixes or suffixes added to "apple". To search for text in these languages, adding a * in the front and back of the word will pick up most of the variants, like this: *apple*.

Arabic and Hebrew PDF Files

Some PDF files store Arabic and Hebrew text in reversed order, from left to right, instead of the logical order in which the characters occur in the text (right to left).  In these files, this means that every word is stored in the PDF file spelled backward, and every line of text has the words in reversed order.  dtSearch checks for this condition when it indexes PDF files and inverts the order of the characters within reversed Hebrew and Arabic words, so these words will still be searchable.  However, to enable hit highlighting to work, dtSearch does not reverse the order of words on each line, so words within a line will be indexed in the actual order they occur in the PDF file.

Accent-insensitive indexing

dtSearch can create indexes that are either "accent-sensitive" or "accent-insensitive." An accent-insensitive index converts characters, wherever possible, to a "base" character which is either one of the letters A-Z or one of the digits 0-9. Accent-insensitive indexes are generally easier to use because they ensure that a document will be found even if the author omitted an accent, or if the user entering a search request omitted an accent, in typing a word. The following are examples of the character conversions done in an accent-insensitive index: 

 

Character

Unicode value

"Base" Character

Å

U+00C5

A

ç

U+00E7

c

Superscript 8

U+2078

8

Arabic-Indic digit 8

U+0668

8

In an accent-sensitive index, each letter is converted to lower case where possible but otherwise characters are indexed using their Unicode values. In an accent-sensitive index, ç and c would be considered different letters, and a search for one would not find the other.

Alphabet Customization

dtSearch versions 5 and earlier used "alphabet" files with a .ABC extension to provide for customization of the handling of 8-bit characters. This made it possible to define, for each character in the range from 33 to 255, whether it was a letter or not and the rules for capitalization and accents. dtSearch still uses .ABC files, but only for characters in the range from 33 to 127. All other characters are handled according to the definitions in the Unicode character tables.

Troubleshooting Encoding Problems

No accented letters appear in the indexed word list

Indexes are created accent insensitive by default. This means that all letters are converted to a-z whenever possible, and a search for é is considered equivalent to a search for e. Therefore, no accented letters will appear in the indexed word list. To make an accent-sensitive index, check the "accent sensitive" option in the Create Index dialog box when you create the index.

Text files appear incorrectly in dtSearch, and the words in the indexed word list have missing or scrambled accented characters

Please see this article for troubleshooting steps:  Troubleshooting encoding detection