Using search filters to combine a database search with a full-text search

Describes how to use search filters to implement an application that executes a full-text search against a subset of a database, selected by first executing a query against the database.

Remarks

Suppose you have a collection of documents, and each document is associated with a row in a database. Your application requires that you first select a subset of the database, and then execute a full-text search against the documents in the subset.

A first attempt at this often has the program collect a list of document names based on the database query and then execute a full-text search that is limited to the named documents. For example, the application might build an xfilter() expression listing the names of all of the documents matched in the database. The xfilter expression is then passed to dtSearch along with the search request.

This approach works acceptably when the database selection generates a small number of documents, but when the database selection generates a list of thousands of filenames, the xfilter expression becomes huge and the search either becomes slow or fails due to the excessively-long search request.

To implement a more scalable solution that can handle any size database subset, the application should instead build a SearchFilter from the results of the database query, and then limit the search to the contents of the SearchFilter. The most efficient way to select documents in a SearchFilter is to use the documents' "document id" or "DocId" (a unique integer that identifies each document in an index), so this means the application needs a quick way to convert a database selection into a list of DocIds. Usually this is done by adding a column to the database to store the DocId for each document.

Implementation - Indexing

Indexing uses the data source API to index each document in the database. As each document is indexed, the dtSearch Engine will return the DocId assigned to that document in the index in the DataSource.DocId property. The application should store that value in the database in a DocId column so each row will include the DocId that dtSearch assigned.

Note: Some file formats, such as ZIP files and databases (CSV, DBF, MDB), are indexed as multiple documents. For example, if a data source passes a ZIP file to the dtSearch Engine for indexing, each item in the ZIP file will be indexed as a separate document. This effect will break the one-to-one relationship between database rows and document ids. There a few options to handle this case:

(1) Store ranges. Document ids for a container file formats are always sequential, so the database can be modified to store a range of document ids instead of a single id. To implement this, the data source would keep track of the most recently-assigned document id and when a new document id is assigned, would store the range (LastDocId+1, NewDocId) instead of just NewDocId. For single documents, these two values will be the same. For ZIP files, they will span all items in the ZIP.

(2) Add rows. Instead of using DataSource.DocId, the indexing program can use an indexing callback notification such as the IIndexStatusHandler in the .NET API to receive a callback for each document indexed. The indexing program could then add rows to the database as needed for items inside a container format.

(3) Suppress containers. For some file formats, such as .CSV, it may be preferable to index the document as a single file. To do this, the indexer would check the filename extension and modify the data returned to the indexer if the extension indicates a container format. The type of change needed would depend on the format. For CSV files, the filename extension can simply be changed to .TXT. For more complex formats such as ZIP, DBF, and MDB, the application may either skip the file if appropriate, or convert it to a text format before passing it to the indexer.

Implementation - Searching

1. Select the rows in the database that the search will cover

2. Create a SearchFilter and call SearchFilter.AddIndex() to associate the SearchFilter with the dtSearch index.

3. For each row in the database that was selected, get the DocId and call SearchFilter.SelectItems() to select it. (Selecting an item in a SearchFilter is a very fast operation so calling this function thousands of times will still be quick.)

4. Create a SearchJob, call SearchJob.AddIndexToSearch() with the path to the index, and call SearchJob.SetFilter to limit the search to the items in the filter

5. Set up the other properties of the SearchJob with the search request, any search features required, etc.

6. Execute the search

Group

Limiting searches with SearchFilters