How to index databases with the dtSearch Engine

Article: dts0111

Applies to: dtSearch Engine

1. Indexing databases (SQL, NoSQL, etc.) and other data sources

a. Sample code and documentation

b. Incremental updates

c. Binary documents in fields (BLOB data)

d. Selecting which columns and fields to index

e. Highlighting hits

2. Indexing desktop databases

1. Indexing databases and other data sources

Sample code and documentation

This dtSearch Engine has an API that can index any data you can access from .NET, Visual Basic, Java, or C++. A sample database indexing application included with the dtSearch Engine uses Active Data Objects (ADO.NET) to index any database that can be accessed through ADO, such as SQL databases. The same approach will work with other types of non-file data, such as disk images, NoSQL data, SharePoint and network data streams.

Source code for the database indexing sample application is installed to these locations:

C:\Program Files\dtSearch Developer\examples\vb.net4\ado_demo (VB.NET)
C:\Program Files\dtSearch Developer\examples\cs4\ado.net (C#)

The API's used to implement this feature are documented here:

.NET DataSource class

.NET Standard/.NET Core DataSource class

C/C++ dtsDataSource

Java DataSource2 class

COM Data Sources

The ado_demo sample application uses the database schema to determine the tables and fields that are present in the database, and then iterates over every row of every table, indexing the entire database. The source code can be modified to exclude certain tables or fields from the index.

The basic mechanism used to implement data source indexing is the same in all of the APIs:

1. The calling application creates a data source object that can iterate over the content to be indexed. The object implements a "GetNextDoc" method that returns the next document to be indexed.

2. The calling application attaches the data source object to an IndexJob, and when the IndexJob is executed, the dtSearch Engine calls the GetNextDoc method repeatedly to get data to be added to the index.

3. For each item returned by GetNextDoc, the data source can return plain text, fielded data, and/or a BLOB containing binary document data (such as a PDF file). The data source can choose whatever name it wants for the row, as long as the name has the form of a valid Win32 filename or URL. The name will be returned as the document filename in search results.

Usually the name includes a row ID for the row, making it easier to retrieve a row from the database using the filename returned in search results.

The DataSource API leaves it up to the calling application to define the field-value pairs that are included in each logical document. Typically a single logical document would correspond to one table row, but the field-value pairs can come from anywhere, so an application could start with the field-value pairs in a particular row and then use some values to retrieve other content from other tables to include in the logical document.

Incremental Updates

An index of a database that was created using the data source API can be updated incrementally. To do this,

1. Set up the IndexJob exactly as it was set up when the index was initially created, but set all of the Action flags (ActionCreate, ActionRemoveDeleted, etc.) to false), except the ActionAdd flag, which must be set to True. If ActionCreate is true, the index will be cleared at the start of the index update.

2. Set up your data source so it will only iterate over database rows that are either (a) new, or (b) have been modified since they were last indexed. If the data source iterates over the whole database, dtSearch will still only index new or modified rows (it checks the modification date and size of each item to see if it has changed before reindexing). However, incremental updates will be much faster if you just pass dtSearch the rows that need to be indexed.

To support incremental updates, it is often helpful to have a boolean "NeedsIndexing" field in the database indicating that a row requires reindexing. The field can be set to true when a row is added or modified, and false when a row is indexed. To perform an incremental update, the data source would select only those rows where NeedsIndexing=true.

To remove deleted rows from an index, set IndexJob.ActionRemoveListed = true and set IndexJob.ToRemoveListName to the name of a text file with a list of the rows to delete. For more information on removing items from an index, see "Removing documents from an index."

Binary documents in fields (BLOB data)

If data in a column consists of binary files such as Word documents or Excel spreadsheets, you can index the contents of these columns along with other fields. The data source API provides a way to combine (field, value) pairs from other columns of the same row with a binary document such as a Word file into a single logical document for indexing and searching purposes. dtSearch will index the document just as it would if found on disk, including any fields in the document itself (such as Document Summary Information fields). For more information on how this is done, see the article, "How to add fields to documents during indexing."

Highlighting hits

To highlight hits in a retrieved document, use the FileConverter object to convert the document to HTML, RTF, or text, with hit highlight markings around the hits. For documentation on FileConverter, see:

Highlighting Hits - Overview [dtSearchApiRef.chm]

When data has been indexed using a data source, there is often no disk file to use as input so the original document must be reconstructed. There are two ways this can be done:
(1) Set up the FileConverter with the same data that was supplied to the IndexJob in the DataSource object. For example, in the .NET API, set FileConverter.InputText, InputFields, and InputBytes to the values returned through the data source as DocText, DocFields, and DocBytes.

(2) You can set up the index so it will cache each document as it is indexed. Then instead of reconstructing the document to highlight hits, you can set FileConverter.Flags = dtsConvertGetFromCache, so the cached copy of the document will be used to highlight hits.

Selecting which tables and columns to index

The ado_demo samples all implement a SampleDataSource class that traverses the entire database, using schema information to get the list of tables and columns in the database. To modify the sample code so it indexes only certain tables or only certain fields, modify the SampleDataSource class, changing the code that gets the list of tables and columns from the schema to instead use a specific list of tables and columns.

When indexing very large tables, your indexing application may run out of memory if you attempt to select the entire table at once. If this occurs, you can modify your data source implementation to select batches of rows (i.e., 1-100,000, 100,001-200,000, etc.) to index.

2. Indexing desktop databases with the dtSearch Engine

dtSearch indexes desktop database formats directly, without the need for OBDC or other drivers. These include: Microsoft Access (*.mdb, *.accdb), XML, CSV (comma-separated values), and XBase (FoxPro, dBASE, and other .DBF-compatible formats).

Additional Information

For more information about indexing databases with dtSearch, please see the following topics:

Field Searching [dtSearchApiRef.chm]

Indexing Databases [dtSearchApiRef.chm]

How to get field data in search results

How to add fields to documents

How to use dtSearch Web with dynamically-generated content