dtSearch Text Retrieval Engine Programmer's Reference
dtsDataSource Structure

Used by dtsIndexJob to index non-file data that is passed to the indexer through callbacks.

File: dtsviewr.h

struct dtsDataSource { void * pData; int (* rewind)(void *pData); int (* getNextDoc)(void *pData, dtsInputStream& dest); dtsIndexFileInfo * pFileInfo; int (* getNextDocInfo)(void *pData, dtsDataSourceFileInfo& fileInfo); int (* getCurrentDoc)(void *pData, dtsInputStream& dest); int (* getDocInfoByName)(void *pData, const char *docName, const char *userFields, dtsDataSourceFileInfo& fi); };

The dtsIndexJob structure provides two ways specify the data to index: by files (the toAdd member) and by data source (the dataSourceToIndex member). Most commonly, the text exists in disk files, in which case you would specify the files to be indexed using the toAdd member, which provides ways to specify the directories and files you want to index. In some situations, however, the text to be indexed may not be readily available as disk files. For example, the text may exist as records in a remote SQL database. You could copy the text from the database to local disk files and index the local disk files, but the dtsDataSource API provides a more direct and efficient solution. To supply this text to the dtSearch indexing engine, you create an object that accesses the text and then attach a dtsDataSource describing the object as the dataSourceToIndex member of a dtsIndexJob.

Basic Data Source Implementation

A dtsDataSource is a structure that provides access to any source of text data divided into logical documents. It consists of a set of function pointers for functions to retrieve documents and to iterate over the data to be indexed. The simplest possible data source would implement two function pointers: rewind, to initialize the data source, and getNextDoc, to get the next document to index. A dtsDataSource also contains a pData pointer that will be passed to the rewind() and getNextDoc() functions. To create a data source, you would do the following:

  1. Create an object that access the data source.
  2. Set the pData member of a dtsDataSource to point to this object.
  3. Make callback functions that convert the pData pointer to the object type and then call the appropriate member function, and pass those callback functions to the dtSearch engine in the dtsDataSource.

Your data source object (the CMyDataSource object in the example below) will return logical documents using another structure, a dtsInputStream. For each document, dtsInputStream provides its filename (any legal Windows filename), creation and modification dates, and size as it will be recorded in the index. Like dtsDataSource, dtsInputStream relies on callback functions -- seek() and read() -- to provide access to the data to be indexed. 

dtSearch indexes the dtsInputStream objects created by the dtsDataSource by calling their seek() and read() members to get text data. After a dtsInputStream has been indexed, dtSearch will destroy it through a call to its release() member. dtSearch does not delete the dtsDataSource object so the caller is responsible for disposing of it when release() is called. 

To avoid requiring an unnecessary initial pass through the input data before indexing, dtsDataSource knows nothing about the total size of the data to be indexed or the number of documents to be indexed. As a result, the dtSearch Engine will not be able to report the percentage completion of an indexing job involving a dtsDataSource. 

For an example demonstrating use of the dtsDataSource API, see dsource.cpp.

Efficient Incremental Updates

Once a data source has been indexed, subsequent updates will be faster if the data source can quickly identify which documents have been modified since the last update, and which documents no longer exist. To make incremental updates more efficient, a data source can implement three additional functions: getNextDocInfo, getCurrentDoc, and getDocInfoByName

getNextDocInfo and getCurrentDoc allow dtSearch to separate getNextDoc into two calls, one to get the document properties, and a second to get the document itself. The advantage of doing this is that dtSearch can skip the second step for documents that have already been indexed. getNextDocInfo returns information about the next document, most importantly the name and modification date. dtSearch uses this to determine if the document needs to be indexed, by comparing the name and modification date with what is already in the index. If the document is not in the index, or an older version is in the index, dtSearch will request the document contents in a call to getCurrentDoc. If the document does not have to be indexed, dtSearch will skip the document and call getNextDocInfo again for the next document's name and modification date. When a getNextDocInfo pointer is provided, getNextDoc will never be called, and dtSearch will rely only on getNextDocInfo and getCurrentDoc

getDocInfoByName lets dtSearch check whether a particular document still exists, for purposes of implementing the "Remove Deleted" step in an index update.

// Assume that CMyDataSource is an object with rewind() // and getNextDoc() members that access text to be indexed. // The dtSearch engine will call the rewind function to initialize // the data source. Just pass the call through to the object. static int rewindCallback(void *pData) { CMyDataSource *s = (CMyDataSource *) pData; return s->rewind(); } // The dtSearch Engine will call the getNextDoc to get documents // to be indexed. Again, just pass the call through to the // CMyDataSource object. static int getNextDocCallback(void *pData, dtsInputStream& doc) { CMyDataSource *s = (CMyDataSource *) pData; return s->getNextDoc(doc); } void BuildIndex() { dtsIndexJob indexJob; ... set up the index path, name, etc. ... // Attach data source to be indexed CMyDataSource myData; dtsDataSource ds; ds.pData = &myData; ds.rewind = rewindCallback; ds.getNextDoc = getNextDocCallback; indexJob.dataSourceToIndex = &ds; // Build the index dtssDoIndexJob(indexJob, result); ...