How to add fields to documents during indexing

Article: dts0150

Applies to: dtSearch Engine

When documents are indexed, sometimes it is useful to be able to add one or more custom fields to the index along with each document, without changing the original file. For example, a document management program may have a series of document attributes (Author, Subject, Department, etc.) that must be associated with each file checked into the document manager.

One way to add field data is to use the field capabilities of the document itself. For example, Word and Excel documents can have one or more "Summary Information" fields. Similarly, HTML files can have META tags, WordPerfect files can have Document Summary fields, and PDF files support several predefined fields. dtSearch can recognize these fields and will automatically index them along with the document text.  For more information on automatically-detected fields, please see:  What file formats does dtSearch support?

Alternatively, the field information can be supplied to the dtSearch Engine during indexing using one of the "data source" APIs.

C#

In the dtSearch Engine's .NET API you can use the DataSourceToIndex property of an IndexJob to specify an object that will supply documents to be indexed. For each document, your DataSourceToIndex would supply a DocName, DocText (plain text), DocFields (any number of tab-delimited field-value pairs), a DocModifiedDate and a DocCreatedDate.  In DocBytes, you can provide a byte array with the binary contents of the document, which dtSearch will parse just as it would parse a file on disk.

For sample code demonstrating how to index meta-data in a database along with document files, see this sample application, included with the dtSearch Engine:

C:\Program Files\dtSearch Developer\examples\cs4\ado_demo

C++

In C++, you can use the dataSourceToIndex member of the dtsIndexJob structure to specify an object that will supply documents to be indexed. For each document, your data source would supply seekWeb and readWeb callback functions providing access to the binary data in the document, along with any number of fields in a null-delimited string set.

For more information, see these topics in the dtSearch Engine help file: "dtsDataSource," "dtsInputStream".

Java

The Java API to the dtSearch Engine Java supports a dataSourceToIndex property that has the same interface as the Visual Basic interface, except that the property names are prefaced with "get":

public interface DataSource {

    public boolean getNextDocWeb;

    public boolean rewindWeb;

    public String getDocTextWeb;

    public String getDocFieldsWeb;

    public String getDocNameWeb;

    public String getDocDisplayNameWeb;

    public java.util.Calendar getDocModifiedDateWeb;

    public java.util.Calendar getDocCreatedDateWeb;

    public boolean getDocIsFileWeb;

     };

 

An extended version of the API also supports returning BLOB data in a memory buffer (an array of byte[]).  For more information, see the DataSource2 interface documentation.

Additional Information

Some or all of the fields added through the data source APIs can be designed as "stored" fields. A stored field is a field that becomes part of the document properties that are returned in search results. For example, a document manager may add a stored row ID field to each file that is indexed so that when a file is returned in search results, the row ID field will be easily accessible.

For more information on stored fields, see "How to get field data in search results."

Related Topics

"How to index databases with the dtSearch Engine"

"Indexing Databases" in the dtSearch Engine help file.

"Database and Field Searching" in the dtSearch Engine help file.