Troubleshooting encoding detection

Article: dts0203

Symptom

Random characters appear in place of non-English text in documents

Cause

The encoding of the input data is not detected correctly.

Some file types, such as single-byte text files and some HTML files, lack encoding information that specifies what language should be used to interpret the characters in the file.  When you index these types of files in dtSearch, it will analyze the document text to infer the language so it can apply the appropriate encoding.   This process usually produces the correct encoding, but it is inherently inexact and so in some cases you will see random characters.

Resolution

To tell dtSearch to use a specific character set instead of attempting to infer the encoding, click Options > Preferences > File Types, and select the character set to use under "Default character encoding".  For Western European languages, the "CP-1252 Windows Latin-1" option will work best.

To remove the ambiguity from the documents so they will contain the encoding information needed to interpret them accurately, the solution depends on the file type.

Word 95 documents:  Open the document in a current version of Word and save it.  

HTML files:  In your HTML editor, save the document using the UTF-8 encoding with a META tag near the top of the file indicating this.  The META tag will look like this:  

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Plain text files:  Open the file in Notepad, click File > Save As, and select "Unicode text" as the format.

dtSearch Engine Applications

In an application that uses the dtSearch Engine, to set the default character encoding to use for documents that lack a specified encoding,

(1) Start dtSearch Desktop, click Options > Preferences > File Types, and select the character set to use under "Default character encoding".  This setting will be saved in a file named filetype.xml in your dtSearch user data folder.

(2) Locate the filetype.xml file in your dtSearch UserData folder and copy this file into one of your application's folder.  

(3) Each time your application starts, use Options.FileTypeTableFile to tell the dtSearch Engine to use the filetype.xml file that you created in step (1).