Optimizing search performance with the dtSearch Engine

Article: dts0153

 

Applies to: dtSearch Engine 

This article describes programming techniques to maximize searching performance with the dtSearch Engine.

How dtSearch executes a search

Each search of an index involves two steps: (1) word lookup, and (2) enumeration of the documents that satisfy the search request. The first step generally takes a small fraction of a second, even for very large indexes, and is usually not a significant part of the total time required for the search. The time required for the second step is proportional to the number of documents retrieved. The second step will take ten times longer for a search that finds 1000 files than for a search that finds 100 files.

The key to fast searching is, therefore, minimizing the time required for the second step, enumerating the files that match the search request.

Hardware and memory use

The most important factor affecting search time is usually disk read performance, so using an SSD drive for the index is recommended.  

Memory requirements depend on the number of documents stored in SearchResults, the number of hits, and the amount of metadata associated with each document.  Each item in search results requires about 1k of data for the filename, other metadata, and hits.  More memory may be required if the number of hits is very large (each hit offset is stored as a 32-bit integer) and for any metadata added to the index as stored fields.

Threading

The dtSearch Engine is designed to operate across the web in a completely "stateless" manner, meaning that no information about a user is retained between requests. This makes adding capacity to a site easier because you can simply add more servers, without the need to tie each user session to a particular server. Therefore, the recommended way to add capacity to a dtSearch Engine search site is to clone the site on multiple, separate servers, and to use web server load-balancing software to allocate user requests among the servers.

The dtSearch Engine uses an efficient threading model so that it can handle multiple concurrent queries simultaneously, and can make efficient use of multiple processors where present. There is no built-in limit on the number of concurrent users the dtSearch Engine can handle. Searching is done without any need for file or record locks, so aside from the need to share CPU and other hardware resources, one search user has no effect on another concurrent search user.

Minimize the Number of Documents Retrieved

The MaxFilesToRetrieve value limits the number of items returned in search results to the best-matching documents. The search is fully executed, so the search job will return the correct value for the total number of files retrieved. For example, if you do a search with a MaxFilesToRetrieveValue of 100 and the search finds 1400 documents, the search results object will contain 100 items (the best-matching 100 files found in the search) and the search job will indicate that a total of 1400 documents were found. 

The "best matching" documents are selected on the basis of relevance. 

MaxFilesToRetrieve can be used in combination with the dtsSearchDelayDocInfo flag to greatly improve search performance for long searches.

Reconsider Application Requirements that Mandate Huge Results Sets

End-users will not generally read more than few dozen items from a search results list. In nearly all cases where an application seems to require very large search results sets of 10,000 or more documents, the huge results lists are usually a means to some other end that can be accomplished more efficiently.  Two common examples of this are: integration of database searches with full-text searches, and applications where users require complete search results.

Integration with Database Searches

If your application requires very large results sets because post-search filtering has to be done to identify relevant documents (for example, to limit the documents returned to those that satisfy criteria from a separate database search), consider using a SearchFilter object to apply the limitation before the dtSearch search, so dtSearch will return only the documents that satisfy all relevancy criteria.  For more information on using SearchFilters to combine a database search with a full-text search, see:
Using search filters to combine a database search with a full-text search

Paging through Search Results

If your application requires very large results sets because users must have access to complete results, consider using a paging approach in your user interface.  For example, in a web search application, limit the search to the best-matching 100 documents and include a "Next Page" link that re-invokes the search, this time requesting the best-matching 200 documents (and reporting only the last 100 to populate the second page of items).  The interface can also provide a link or button to "Show all files" that users can click if they really want to see everything in the search results list.  Making this optional instead of the default will make searching faster for the vast majority of users who just want to see the best-matching files.

API to use MaxFilesToRetrieve:
.NET:  Set SearchJob.MaxFilesToRetrieve
C/C++:  Call DSearchJob.SetMaxFilesWeb
Java:  Call SearchJob.setMaxFilesToRetrieveWeb

The DelayDocInfo flag

The dtsSearchDelayDocInfo search flag optimizes a search by waiting until document records are requested through a SearchResults object before reading them from the index. 

For example, suppose you execute a search with a MaxFilesToRetrieve value of 100 (so only the best-matching 100 documents will be returned), and the search retrieves 2000 documents. Using the dtsSearchDelayDocInfo flag tells dtSearch to ignore the document information for the 1900 documents that did not make it into the top 100 and only to read the information for the best-matching 100 files. 

Using the dtsSearchDelayDocInfo flag can improve the performance of long searches by a factor of 5 to 10 or more. 

Search callback notifications 

During a search, your program can receive callback notifications as each file is retrieved. If the dtsSearchDelayDocInfo flag is set, you can still receive these notifications, but the filenames will be blank.

API to use DelayDocInfo:
.NET: Set the SearchJob.DelayDocInfo flag to True
C/C++: Set the dtsSearchDelayDocInfo bit in dtsSearchJob.searchFlags
C++ Support Classes:
Call DSearchJob's SetDelayDocInfoWeb method
Java:
Call SearchJob.setSearchFlags(Constants.dtsSearchDelayDocInfo)

TimeoutSeconds and AutoStopLimit

Using the MaxFilesToRetrieve setting and the DelayDocInfo flag, the amount of time required for each item retrieved in a search can be reduced quite a bit, but search time will still be proportional to the number of documents retrieved. While a search that finds 1000 documents will still be very fast, a search that retrieves a sufficiently large number of documents can still take a long time.

To prevent searches that find hundreds of thousands or millions of documents from consuming an excessive amount of resources on a server, you can set two limits in a search job that will halt the search unconditionally.

First, TimeoutSeconds will halt a search that takes more than a certain number of seconds. For example, if you set TimeoutSeconds = 5, a search will always halt after five seconds.

Second, AutoStopLimit will halt a search after a specified number of documents have been retrieved. For example, if you set AutoStopLimit to 1000, the search will automatically halt after 1000 documents have been found.

AutoStopLimit vs MaxFilesToRetrieve AutoStopLimit tells dtSearch to stop searching after a certain number of files have been found, while MaxFilesToRetrieve tells dtSearch how many files to return in search results. For example, if AutoStopLimit is 500 and MaxFilesToRetrieve is 100, then the search will return the best-matching 100 files out of the first 500 retrieved.

How to tell if a limit was reached The error handler for a search job will tell you if the search job was terminated because of one of these limits. If the TimeoutSeconds value was reached, it will contain dtsErTimeout (13). If the AutoStopLimit was reached, it will contains dtsErSearchLimitReached (120).

API to use Timeout Seconds:
.NET: Set SearchJob.TimeoutSeconds
C/C++: Set dtsSearchJob.timeoutSeconds
C++ Support Classes:
Call DSearchJob.SetTimeoutSecondsWeb
Java:
Call SearchJob.setTimeoutSecondsWeb

API to use AutoStopLimit:
.NET: Set SearchJob.AutoStopLimit
C/C++: Set dtsSearchJob.autoStopLimit
C++ Support Classes:
Call DSearchJob.SetAutoStopLimitWeb
Java:
Call SearchJob.setAutoStopLimitWeb

Efficient Queries

dtSearch is designed to perform fast searches for keywords in documents. In building applications with dtSearch, developers often want to use the search indexes to perform other, more database-like functions, such as searching for ranges of values. While this is possible with dtSearch, it is important to understand the trade-offs of various query types to make efficient use of dtSearch search functions.

(1) Minimize the number of "words" matched by your query.

Consider a document manager that uses a document "ID" consisting of a two-letter state code and a number, such as ID_NY01234 or ID_CA54312. To enable user to search by state, the document manager would add "and ID_NY*" or "and ID_CA*" to the end of the query, so that only documents matching the requested state prefix would be returned.

The problem with this approach is that it generates a separate word match for each document in the state requested. If there are 20,000 New York documents, the search involves 20,000 words in addition to the user's search request.

A better approach is to design identifiers to support the types of searches you expect to perform. In this example, adding a space before the number in the document ID would dramatically improve the efficiency of the queries needed to search by state. The improved document IDs would look like "ID_NY 01234" and "ID_CA 54312". Making the state and the number separate words makes it possible to search for "ID_NY" (one word) instead of "ID_NY*" (potentially thousands of words) when looking for New York documents.

(2) Match the precision of values to what you are searching for.

Suppose a date/time field is usually used to search for date ranges.  If the field contains a value that includes the time with one-second precision, then a date range covering 10 days might match tens of thousands of unique words.  If the value has only one-day precision, then a date range search covering 10 days would only match 10 unique words, making the search much more efficient.  For situations where varying levels of precision may be needed, you can include multiple values.  For example, instead of a single date/time field, you might have two fields with different levels of precision:

<date>20081231</date>

<datetime>20081231093045</datetime>

(3) Use File Conditions when possible

A File Condition is a condition that documents must satisfy in a search but that does not generate hits. File Condition searches can be much faster than adding words to a search request. Most commonly, File Conditions are used for things like filename patterns or date range searches (for example, a search for documents modified after December 12, 2000, or for documents named "x*.doc").  You can also require that documents satisfying a condition must contain one or more words.  Examples:

Document must contain a word beginning with abc:

xfilter(word "abc*")  

Document must contain "date" followed by a range from 20020101 to 20020131:

xfilter(word "date20020101~~date20020131")  

Document must contain a value between 20020101 and 20020131 in the field named datefield:

xfilter(word "datefield::20020101~~20020131")  

Please see the "File Conditions" topic in the dtSearch Engine help file for more information on the syntax for adding File Conditions to a search.  

 (4) Replace frequently used and costly search expressions with SearchFilters

If a search expression is frequently used in your application and is also relatively costly in terms of performance, consider replacing it with a SearchFilter.  Two examples of search expressions that are good candidates for this approach are:  expressions that will match most of the documents in a large document collection (i.e., "IsDocument contains true"), and expressions that contain long lists of identifiers (i.e., "(UserCode contains 10001) or (UserCode contains 10002) or ... or (UserCode contains 15679)").  SearchFilters can be constructed each time the index is updated, saved to disk, and applied to searches at close to zero performance cost.  For information on SearchFilter objects, see "Limiting Searches with SearchFilters" in the dtSearch Engine Programmer's Reference.

For more information on efficient ways to implement document classification in searches, see Implementing document classification.

IndexCache

An index cache can make searching substantially faster when a series of searches must be executed against a small number of indexes. The index cache maintains a pool of open indexes that will be available for searching on any thread. This eliminates the need to open and close the indexes being searched for every search request.

API to use IndexCache
.NET: Create an IndexCache object and call SearchJob.SetIndexCache
C/C++: Create a dtsIndexCache object and use dtsSearchJob.indexCacheHandle to attach it to a search

Execute vs. ExecuteInThread

Applications written in .NET or using the COM interface have two ways to execute a search: Execute, or ExecuteInThread. The choice between these two depends on how the calling application will monitor the progress of the search.

The most efficient option is to use Execute with the AutoStopLimit, TimeoutSeconds, and MaxFilesToRetrieve properties set up before the search to limit the amount of resources a search will consume. This is usually the best option for use on web servers.

Desktop-Based applications may have to monitor and control a search job more closely. Usually there will be a "Cancel" button that the user can click to halt the search, and the application may also display a running count of the number of files retrieved. The need to implement a "Cancel" button and to track the progress of the search can be handled in either of two ways: (1) callback functions implemented through the StatusHandler property of a SearchJob, or (2) searching in a separate thread.

When an application uses Execute to start a SearchJob, it can use the StatusHandler callback functions to monitor the progress of a search and cancel the search if the user hits a "Cancel" button. ExecuteInThread starts the search job in a separate thread and returns immediately, and provides IsThreadDone and AbortThread methods to enable the calling application to cancel the search.

In an application that implements a "Cancel" button feature, ExecuteInThread is much faster than Execute because it eliminates the need for the callback functions. Callbacks from Visual Basic or other COM-Based languages are very time-consuming because each one involves a separate IDispatch invocation. In .NET, callbacks are faster but still potentially time-consuming.  Therefore, for applications that need to implement something like a "Cancel" button, ExecuteInThread should be used instead of Execute for SearchJobs.

Note: An application that uses ExecuteInThread should not call IsThreadDoneWeb continuously in a loop to check whether the thread is done.  Instead, it should call SleepWeb between IsThreadDoneWeb calls to give the searching thread time to work.