You are here: Overviews > File Parsers > The File Type Table
Close
dtSearch Text Retrieval Engine Programmer's Reference
The File Type Table

How to use the File Type Table file to control file type detection.

For nearly all of the file formats that dtSearch supports, it is possible to detect the format automatically based on the binary contents of the file. For these formats, it is not necessary to specify the format for dtSearch. 

However, some file formats do not contain sufficient information to identify the file type reliably, and some file types can be ambiguous. For example, when indexing a file that contains HTML as well as extensive script data, in some cases you may want to index the whole file as text (to make the script source code searchable), while in other cases you may want only the visible text to be searchable (when implementing a search function for a web site). In these situations, the file type table provides a way to control how dtSearch indexes your data. 

The file type table can also be used to specify that some files should be indexed using installed IFilters instead of the file parsers included with dtSearch.

filetype.xml

The file type table is an XML file, named filetype.xml by default, containing a list of rules. For each rule, there will be a list of filename filters identifying the files covered by the rule, and a numerical identifier (from the dtsInputType enumeration) specifying the file format. 

Additionally, the file type table contains a string value identifying the default character encoding to use for single-byte files. 

The easiest way to create a filetype.xml file is to use the Options > Preferences > File Types dialog box in dtSearch Desktop. The rules will be stored in a filetype.xml file in your dtSearch UserData folder. 

Once you have created a filetype.xml file, use Options.FileTypeTableFile to tell the dtSearch Engine to use it. 

Example:

<?xml version="1.0" encoding="UTF-8" ?> <dtSearchFileTypeRules> <DefaultEncoding>Auto-detect (Recommended)</DefaultEncoding> <Item> <Name>IFilter</Name> <TypeId>265</TypeId> <Filters>*.vsd</Filters> <Flags>1</Flags> </Item> </dtSearchFileTypeRules>
Item

Each rule is enclosed in <Item>...</Item> tags

Name

The name is used in the dtSearch Desktop Options > Preferences > File Types dialog box to identify each rule. Aside from this effect on the dtSearch Desktop user interface, it has no effect.

TypeId

A member of the dtsInputType enumeration, specifying the file parser to use for documents matching this rule.

Filters

Filename filters specifying the documents matching this rule

Flags

The following values for Flags are supported: 

1 - Override the file parser detection in the dtSearch Engine (otherwise, dtSearch will try to match the file against known, identifiable formats such as Word and will only apply the rule to documents that do not match a known format). This flag will only affect choices between potentially ambiguous formats (such as text vs. CSV or HTML vs XML). It cannot cause the Word file parser, for example, to open a PDF file, because these file formats have unambiguous binary signatures. 

2 - Disable the file parser. For example, if you set TypeId to 261 (it_CSV) and Flags to 2, this will disable the CSV file parser, so CSV files will be interpreted as plain text. The filename filter is ignored when this flag is set. 

4 - Enable the file parser. Use this to turn on file parsers that are disabled by default. For example, if you set TypeId to 335 (it_MicrosoftAccessAsDocument) and Flags to 4, this will enable the file parser that handles each Microsoft Access database as a single long document instead of many shorter records. 

 

DefaultEncoding

Single-byte text files and HTML files that do not include a meta tag specifying the encoding are inherently ambiguous. By default, dtSearch will try to infer the encoding used in these files. To override this behavior, you can use the DefaultEncoding value in filetype.xml to specify a particular encoding to use. Supported values for this field are:

Auto-detect (Recommended) CP1250 Windows Eastern European CP1251 Windows Cyrillic CP1252 Windows Latin-1 CP1253 Windows Greek CP1254 Windows Turkish CP1255 Windows Hebrew CP1256 Windows Arabic CP1257 Windows Baltic CP1258 Windows Vietnamese ISO8859-1 Latin-1 ISO8859-2 Latin-2 ISO8859-3 Latin-3 ISO8859-4 Latin-4 ISO8859-5 Cyrillic ISO8859-6 Arabic ISO8859-7 Greek ISO8859-8 Hebrew ISO8859-9 Latin-5 CP437 MS-DOS CP737 PC Greek CP775 PC Baltic CP850 MS-DOS Latin-1 CP852 MS-DOS Latin-2 CP855 IBM Cyrillic CP856 IBM Hebrew CP857 IBM Turkish CP860 MS-DOS Portuguese CP861 MS-DOS Icelandic CP862 PC Hebrew CP863 MS-DOS Canadian French CP864 PC Arabic CP865 MS-DOS Nordic CP866 MS-DOS Russian CP869 IBM Modern Greek CP874 IBM Thai CP875 IBM Greek GB2312 Shift-JIS Big5
Copyright (c) 1995-2021 dtSearch Corp. All rights reserved.