Indexing Databases

Indexing SQL databases and other non-file data sources.

Remarks

Internally parsed database formats

Databases in the following formats can be indexed directly: XBase (*.dbf), Microsoft Access (*.accdb, *.mdb), Comma-separated values (*.csv), XML. dtSearch automatically detects these formats and and will index files in these formats as databases without the need to use the data source indexing API described in this article.

Sample code and documentation

This dtSearch Engine has an API that can index any data you can access from .NET, Visual Basic, Java, or C++. The API's used to implement this feature are:

Language/Environment	API to index databases
C/C++	dtsDataSource
C++ Support Classes	DDataSourceBase
.NET (C#, VB.NET)	IndexJob.DataSourceToIndex (dtSearch.Engine.DataSource class)
Java	com.dtsearch.engine.DataSource and com.dtsearch.engine.DataSource2 classes
COM (Visual Basic, ASP)	IndexJob.DataSourceToIndex

The basic mechanism used to implement data source indexing is the same in all of the APIs:

(1) A data source object is created that iterates over the content to be indexed. The object implements a "GetNextDoc" method that returns the next document to be indexed.

(2) This object is attached to the IndexJob, and when the IndexJob is executed, the dtSearch Engine calls the GetNextDoc method repeatedly to get data to be added to the index.

(3) For each item returned by GetNextDoc, the data source can return any combination of plain text, fielded data, or binary document data such as a PDF file. The data source can choose any name for the row, as long as the name has the form of a valid Win32 filename. The name will be returned as the document filename in search results. To avoid confusion with relative paths, the name should either have the form of a complete path (example: x:\record#123) or a URL (example: db://record#123)

Usually the name includes a row ID for the row, to make it easier to retrieve a row from the database using the filename returned in search results.

ado_demo sample

A sample database indexing application included with the dtSearch Engine uses Active Data Objects (ADO.NET) to index any database that can be accessed through ADO. The same approach will work with other types of non-file data. Source code for the database indexing sample application is installed to these locations:

C:\Program Files\dtSearch Developer\examples\vb.net4\ado_demo (VB.NET)

C:\Program Files\dtSearch Developer\examples\cs4\ado_demo (C#)

C:\Program Files\dtSearch Developer\examples\vbasic\ado_demo (Visual Basic 6)

The ado_demo sample application uses the database schema to determine the tables and fields that are present in the database, and then iterates over every row of every table, indexing the entire database. The source code can be modified to exclude certain tables or fields from the index.

The ado_demo samples all implement a SampleDataSource class that traverses the entire database, using schema information to get the list of tables and columns in the database. To modify the sample code so it indexes only certain tables or only certain fields, modify the SampleDataSource class, changing the code that gets the list of tables and columns from the schema to instead use a specific list of tables and columns.

Incremental Updates

An index of a database that was created using the data source API can be updated incrementally. To do this,

(1) Set up the IndexJob exactly as it was set up when the index was initially created, but set all of the Action flags (ActionCreate, ActionRemoveDeleted, etc.) to false, except the ActionAdd flag, which must be set to true. If ActionCreate is true, the index will be cleared at the start of the index update.

(2) Set up your data source so it will only iterate over database rows that are either (a) new, or (b) have been modified since they were last indexed. If the data source iterates over the whole database, dtSearch will still only index new or modified rows (it checks the modification date and size of each item to see if it has changed before reindexing). However, incremental updates will be much faster if you just pass dtSearch the rows that need to be indexed.

To support incremental updates, it is often helpful to have a boolean "NeedsIndexing" field in the database indicating that a row requires reindexing. The field can be set to true when a row is added or modified, and false when a row is indexed. To perform an incremental update, the data source would select only those rows where NeedsIndexing=true.

Binary documents in fields (BLOB data)

If data in a column consists of binary files such as Word documents or Excel spreadsheets, you can index the contents of these columns along with other fields. dtSearch will index the document just as it would if found on disk, including any fields in the document itself (such as the document summary information fields in Word documents). This approach can also be used to associate meta-data in a database with documents in disk folders.

Language/Environment	API to supply BLOB data during indexing
C/C++	Use the dtsInputStream returned from the data source to return the binary data, and return the fields as a null-delimited string set in dtsInputStream.fields
C++ Support Classes	Use the dtsInputStream returned from the data source to return the binary data, and return the fields as a null-delimited string set in dtsInputStream.fields
.NET (C#, VB.NET)	Set DataSource.HaveDocBytes to true and return the binary data in DataSource.DocBytes
Java	Using the DataSource2 interface, set haveDocBytes to true and call setDocBytes to store a byte array with the binary data.
COM (Visual Basic, ASP)	In the COM interface, set the DocIsFile property for the data source to true, and, for each document, set the DocName property to the name of a disk file containing the document. (There is no way to return binary documents in memory using the COM interface).

Highlighting hits

When data has been indexed using a data source, there is often no disk file to use as input, so the original document must be reconstructed. There are two ways this can be done:

(1) Set up the FileConverter with the same data that was supplied to the IndexJob in the DataSource object. For example, in the .NET API, set FileConverter.InputText, InputFields, and InputBytes to the values returned through the data source as DocText, DocFields, and DocBytes.

(2) Set up the index so it will cache each document as it is indexed. Then, instead of reconstructing the document to highlight hits, set FileConverter.Flags = dtsConvertGetFromCache, so the cached copy of the document will be used to highlight hits. See Caching documents for more information on caching documents in indexes.

Integrating database searches with full-text searches

For information on integrating database searches with full-text searches, see Limiting searches with SearchFilters.

Group

Databases and Fields