dtSearch Text Retrieval Engine Programmer's Reference
Adding Fields to Documents

Associating metadata with documents, and indexing metadata already present in documents.

The dtSearch Engine supports several mechanisms for indexing fields in, or associated with, documents. 

1. For file formats that support embedded metadata, dtSearch can automatically detect the metadata and index it as fields. For example, document summary information fields in Word documents are automatically indexed as fields. 

2. Using the data source indexing API, fields can be associated with documents when they are indexed. 

3. Fields can be added to HTML files using a comment-based syntax. 

4. Fields can be detected based on markers in the text, using the "Text Fields" feature.

Automatically-detected fields

dtSearch automatically detects and indexes fields in file formats that support internal metadata, such as document summary information fields in Word files. For a list of formats and details on the fields detected in each format, see Automatically Detected Fields

To prevent fields from being indexed in documents, set the dtsoFfSkipDocumentProperties flag in Options.FieldFlags. This setting does not affect CSV, XML, or DBF files. 

The NTFS file system supports file properties for other formats. Set the dtsoFfShowNtfsProperties flag in Options.FieldFlags to have the dtSearch Engine check for and index these properties, where present.

Fields in Custom File Formats or Data Sources

If you are indexing a data source, you can associate searchable fields with documents as they are indexed. This makes it possible to associate meta-data with documents without modifying the original files. In the C++ API, return fields in dtsInputStream.fields for each data source document. In the other APIs, return fields in the DocFields property. For more information, see: "Indexing Databases".

Adding Fields to HTML Files

Fields can be added to HTML files in two ways. 

(1) META tags. A META tag looks like this:

<meta name="FieldName" content="Field value">

Meta tags are not displayed in the user's web browser so they are a good choice for data that must be searchable but not visible. 

(2) HTML comments allow you to designate visible text in HTML files as belonging to a field. Use a comment that contains "field: name" to mark the start of a field and a comment with just "field:" to end a field. A field automatically ends when the next field begins. Use / to separate components of nested field names. Example:

<!-- field: Name --> Joe Smith <!-- field: Address/Street --> 123 Oak Street <!-- field: --> This is not part of any field <!-- field: Address/City --> Middleton <!-- field: Address/State --> Maryland <!-- field: -->

Comments can also be used to exclude portions of HTML from indexing and searching. To exclude a block of HTML from indexing or searching, surround it with tags that look exactly as in the following example:

<!--BeginNoIndex--> ... nothing here will be searchable... <!--EndNoIndex-->


If the dtsoFfHtmlIndexHeadersAsFields flag is set in Options.FieldFlags, the Title of an HTML file is automatically placed in an HtmlTitle field, and content inside <H1>, <H2>, etc. tags is automatically placed in an HtmlH1, HtmlH2, etc. field. In versions prior to 7.88, this behavior was enabled by default and suppressed by the flag , dtsoFfHtmlNoHeaderFields.

Fields in Text Files

To define searchable fields in existing text documents that use a standard format, you can use the Options > Preferences > Text Fields dialog box in dtSearch Desktop to create field definitions that the dtSearch Engine can use to extract fields from the documents, based on markers in the text. The text field definitions are stored in a file named fields.xml. To make the dtSearch Engine use these definitions during indexing, set Options.TextFieldsFile to the full path and filename of the fields.xml file.

Punctuation in Field Names

dtSearch removes punctuation from field names when storing fields in the index, with these exceptions: :&_+=. 

Spaces are also removed. The hyphen is mapped to an underscore. 

When searching, only searchable letters in field names are treated as significant.