dtSearch Text Retrieval Engine Programmer's Reference

Options to control the formatting of FileConverter output.

Output format

Use FileConverter.OutputFormat to specify the type of output to generate. Supported output formats:

Ansi text format. Ansi text can only express a very limited set of characters, so for plain text output, itUTF8 is recommended instead.
HTML output. For HTML input, the conversion will leave the original tags in place, with some exceptions. See Highlighting hits in HTML files for more information.
Rich Text Format (RTF) output.
Unformatted HTML output uses HTML encoding for characters such as < and > and is otherwise the same as plain text output. It is intended for use when generating a synopsis to be included in search results, so all formatting, including line breaks, fonts, etc., is removed.
Plain text output, encoded as UTF-8. The output will not include a UTF-8 byte-order mark (BOM) unless you set the flag dtsConvertIncludeBOM in FileConverter.
itXML output can only be generated from XML input. See Highlighting hits in XML files
The itContentAsXml format organizes document content, metadata, and attachments into a standard XML format. This format is intended for content extraction rather than hit highlighting.

dtSearch automatically detects and extracts metadata from converted documents, such as the Subject, To, From, etc. for an email, or document properties for a Word document. For details on the types of metadata extracted, see Supported File Formats

To control metadata extraction, set Options.FieldFlags before executing a conversion. If you are highlighting hits in a retrieved document, the value of FieldFlags should be identical to the value in effect when the documents were indexed. Otherwise, the change in extracted content could result in incorrect hit highlighting.

Attachments and images

FileConverter can extract embedded attachments, images, or other content such as OLE objects from the input file when performing a conversion. To enable extraction of embedded content, set FileConverter.ExtractionOptions to an ExtractionOptions object specifying the types of content to extract and the location for the extracted files.

HTML output options

To specify content to go inside the <HEAD>...</HEAD> tags in HTML output, use FileConverter.HtmlHead. 

To specify a tag such as <!DOCTYPE html> to go before the default <HTML> tag at the top of the file, use FileConverter.DocTypeTag. 

To specify a <BASE> href for HTML output, use FileConverter.BaseHRef. 

To control the formatting used for metadata tables and attachment delimiters in HTML output, set the dtsConvertUseStyles flag in FileConverter, and include CSS styles in FileConverter.HtmlHead in <style>...</style> tags. Currently, the following standard styles are used in HTML output when dtsConvertUseStyles is set:

CSS style name
Table containing metadata names and values, such as document properties and email to/from/subject/date.
Table cell containing a field name, such as Subject, To, From, or Author.
Table cell containing a field value, such as the subject of an email.
The name of an attachment included in the conversion output.
The name of an embedded file included in the conversion output.
Break between logical divisions in a document, such as slides, worksheets, or pages.
The name of a worksheet in a spreadsheet

For an example of a style sheet implementing these styles, see the DocStyles.css file included with dtSearch, which is installed in the dtSearch templates folder.