The File Type Table

How to use the File Type Table file to control file type detection.

Remarks

For nearly all of the file formats that dtSearch supports, it is possible to detect the format automatically based on the binary contents of the file. For these formats, it is not necessary to specify the format for dtSearch.

However, some file formats do not contain sufficient information to identify the file type reliably, and some file types can be ambiguous. For example, when indexing a file that contains HTML as well as extensive script data, in some cases you may want to index the whole file as text (to make the script source code searchable), while in other cases you may want only the visible text to be searchable (when implementing a search function for a web site). In these situations, the file type table provides a way to control how dtSearch indexes your data.

The file type table can also be used to specify that some files should be indexed using installed IFilters instead of the file parsers included with dtSearch.

filetype.xml

The file type table is an XML file, named filetype.xml by default, containing a list of rules. For each rule, there will be a list of filename filters identifying the files covered by the rule, and a numerical identifier (from the dtsInputType enumeration) specifying the file format.

Additionally, the file type table contains a string value identifying the default character encoding to use for single-byte files.

The easiest way to create a filetype.xml file is to use the Options > Preferences > File Types dialog box in dtSearch Desktop. The rules will be stored in a filetype.xml file in your dtSearch UserData folder.

Once you have created a filetype.xml file, use Options.FileTypeTableFile to tell the dtSearch Engine to use it.

Example:

<?xml version="1.0" encoding="UTF-8" ?> <dtSearchFileTypeRules> <DefaultEncoding>Auto-detect (Recommended)</DefaultEncoding> <Item> <Name>IFilter</Name> <TypeId>265</TypeId> <Filters>*.vsd</Filters> <Flags>1</Flags> </Item> </dtSearchFileTypeRules>

Item

Each rule is enclosed in <Item>...</Item> tags

Name

The name is used in the dtSearch Desktop Options > Preferences > File Types dialog box to identify each rule. Aside from this effect on the dtSearch Desktop user interface, it has no effect.

TypeId

A member of the dtsInputType enumeration, specifying the file parser to use for documents matching this rule.

Filters

Filename filters specifying the documents matching this rule

Flags

The following values for Flags are supported:

1 - Override the file parser detection in the dtSearch Engine (otherwise, dtSearch will try to match the file against known, identifiable formats such as Word and will only apply the rule to documents that do not match a known format). This flag will only affect choices between potentially ambiguous formats (such as text vs. CSV or HTML vs XML). It cannot cause the Word file parser, for example, to open a PDF file, because these file formats have unambiguous binary signatures.

2 - Disable the file parser. For example, if you set TypeId to 261 (it_CSV) and Flags to 2, this will disable the CSV file parser, so CSV files will be interpreted as plain text. The filename filter is ignored when this flag is set.

4 - Enable the file parser. Use this to turn on file parsers that are disabled by default. For example, if you set TypeId to 335 (it_MicrosoftAccessAsDocument) and Flags to 4, this will enable the file parser that handles each Microsoft Access database as a single long document instead of many shorter records.

DefaultEncoding

Single-byte text files and HTML files that do not include a meta tag specifying the encoding are inherently ambiguous. By default, dtSearch will try to infer the encoding used in these files. To override this behavior, you can use the DefaultEncoding value in filetype.xml to specify a particular encoding to use. Supported values for this field are:

Auto-detect (Recommended)
CP1250 Windows Eastern European
CP1251 Windows Cyrillic
CP1252 Windows Latin-1
CP1253 Windows Greek
CP1254 Windows Turkish
CP1255 Windows Hebrew
CP1256 Windows Arabic
CP1257 Windows Baltic
CP1258 Windows Vietnamese
ISO8859-1 Latin-1
ISO8859-2 Latin-2
ISO8859-3 Latin-3
ISO8859-4 Latin-4
ISO8859-5 Cyrillic
ISO8859-6 Arabic
ISO8859-7 Greek
ISO8859-8 Hebrew
ISO8859-9 Latin-5
CP437 MS-DOS
CP737 PC Greek
CP775 PC Baltic
CP850 MS-DOS Latin-1
CP852 MS-DOS Latin-2
CP855 IBM Cyrillic
CP856 IBM Hebrew
CP857 IBM Turkish
CP860 MS-DOS Portuguese
CP861 MS-DOS Icelandic
CP862 PC Hebrew
CP863 MS-DOS Canadian French
CP864 PC Arabic
CP865 MS-DOS Nordic
CP866 MS-DOS Russian
CP869 IBM Modern Greek
CP874 IBM Thai
CP875 IBM Greek
GB2312
Shift-JIS
Big5

Group

File Parsers