Associating metadata with documents, and indexing metadata already present in documents.
The dtSearch Engine supports several mechanisms for indexing fields in, or associated with, documents.
1. For file formats that support embedded metadata, dtSearch can automatically detect the metadata and index it as fields. For example, document summary information fields in Word documents are automatically indexed as fields.
2. Using the data source indexing API, fields can be associated with documents when they are indexed.
3. Fields can be added to HTML files using a comment-based syntax.
4. Fields can be detected based on markers in the text, using the "Text Fields" feature.
dtSearch automatically detects and indexes fields in file formats that support internal metadata, such as document summary information fields in Word files. For a list of formats and details on the fields detected in each format, see Automatically Detected Fields.
To prevent fields from being indexed in documents, set the dtsoFfSkipDocumentProperties flag in Options.FieldFlags. This setting does not affect CSV, XML, or DBF files.
The NTFS file system supports file properties for other formats. Set the dtsoFfShowNtfsProperties flag in Options.FieldFlags to have the dtSearch Engine check for and index these properties, where present.
If you are indexing a data source, you can associate searchable fields with documents as they are indexed. This makes it possible to associate meta-data with documents without modifying the original files. In the C++ API, return fields in dtsInputStream.fields for each data source document. In the other APIs, return fields in the DocFields property. For more information, see: "Indexing Databases".
Fields can be added to HTML files in two ways.
(1) META tags. A META tag looks like this:
Meta tags are not displayed in the user's web browser so they are a good choice for data that must be searchable but not visible.
(2) HTML comments allow you to designate visible text in HTML files as belonging to a field. Use a comment that contains "field: name" to mark the start of a field and a comment with just "field:" to end a field. A field automatically ends when the next field begins. Use / to separate components of nested field names. Example:
Comments can also be used to exclude portions of HTML from indexing and searching. To exclude a block of HTML from indexing or searching, surround it with tags that look exactly as in the following example:
If the dtsoFfHtmlIndexHeadersAsFields flag is set in Options.FieldFlags, the Title of an HTML file is automatically placed in an HtmlTitle field, and content inside <H1>, <H2>, etc. tags is automatically placed in an HtmlH1, HtmlH2, etc. field. In versions prior to 7.88, this behavior was enabled by default and suppressed by the flag , dtsoFfHtmlNoHeaderFields.
To define searchable fields in existing text documents that use a standard format, you can use the Options > Preferences > Text Fields dialog box in dtSearch Desktop to create field definitions that the dtSearch Engine can use to extract fields from the documents, based on markers in the text. The text field definitions are stored in a file named fields.xml. To make the dtSearch Engine use these definitions during indexing, set Options.TextFieldsFile to the full path and filename of the fields.xml file.
dtSearch removes punctuation from field names when storing fields in the index, with these exceptions: :&_+=.
Spaces are also removed. The hyphen is mapped to an underscore.
When searching, only searchable letters in field names are treated as significant.