How to use the File Type Table file to control file type detection.
For nearly all of the file formats that dtSearch supports, it is possible to detect the format automatically based on the binary contents of the file. For these formats, it is not necessary to specify the format for dtSearch.
However, some file formats do not contain sufficient information to identify the file type reliably, and some file types can be ambiguous. For example, when indexing a file that contains HTML as well as extensive script data, in some cases you may want to index the whole file as text (to make the script source code searchable), while in other cases you may want only the visible text to be searchable (when implementing a search function for a web site). In these situations, the file type table provides a way to control how dtSearch indexes your data.
The file type table can also be used to specify that some files should be indexed using installed IFilters instead of the file parsers included with dtSearch.
The file type table is an XML file, named filetype.xml by default, containing a list of rules. For each rule, there will be a list of filename filters identifying the files covered by the rule, and a numerical identifier (from the dtsInputType enumeration) specifying the file format.
Additionally, the file type table contains a string value identifying the default character encoding to use for single-byte files.
The easiest way to create a filetype.xml file is to use the Options > Preferences > File Types dialog box in dtSearch Desktop. The rules will be stored in a filetype.xml file in your dtSearch UserData folder.
Once you have created a filetype.xml file, use Options.FileTypeTableFile to tell the dtSearch Engine to use it.
Example:
Each rule is enclosed in <Item>...</Item> tags
The name is used in the dtSearch Desktop Options > Preferences > File Types dialog box to identify each rule. Aside from this effect on the dtSearch Desktop user interface, it has no effect.
A member of the dtsInputType enumeration, specifying the file parser to use for documents matching this rule.
Filename filters specifying the documents matching this rule
The following values for Flags are supported:
1 - Override the file parser detection in the dtSearch Engine (otherwise, dtSearch will try to match the file against known, identifiable formats such as Word and will only apply the rule to documents that do not match a known format). This flag will only affect choices between potentially ambiguous formats (such as text vs. CSV or HTML vs XML). It cannot cause the Word file parser, for example, to open a PDF file, because these file formats have unambiguous binary signatures.
2 - Disable the file parser. For example, if you set TypeId to 261 (it_CSV) and Flags to 2, this will disable the CSV file parser, so CSV files will be interpreted as plain text. The filename filter is ignored when this flag is set.
4 - Enable the file parser. Use this to turn on file parsers that are disabled by default. For example, if you set TypeId to 335 (it_MicrosoftAccessAsDocument) and Flags to 4, this will enable the file parser that handles each Microsoft Access database as a single long document instead of many shorter records.
Single-byte text files and HTML files that do not include a meta tag specifying the encoding are inherently ambiguous. By default, dtSearch will try to infer the encoding used in these files. To override this behavior, you can use the DefaultEncoding value in filetype.xml to specify a particular encoding to use. Supported values for this field are: