Techniques to use for faster searching.
For each index that dtSearch searches, there are two steps to the search: (1) word lookup, and (2) enumeration of the documents that satisfy the search request. The first step generally takes a small fraction of a second, even for very large indexes, and is usually not a significant part of the total time required for the search. The time required for the second step is proportional to the number of documents retrieved. Therefore, the second step will take ten times longer for a search that finds 1000 files than for a search that finds 100 files.
An important factor in fast searching is, therefore, minimizing the time required for the second step, enumerating the files that match the search request. To do this:
(1) Use the dtsSearchDelayDocInfo flag in SearchJob to delay reading of document information records until they are needed.
(2) Use AutoStopLimit and MaxFilesToRetrieve to limit the number of items that can be retrieved in a search.
Applications written in Visual Basic, Delphi, or other languages that use the COM interface have two ways to execute a search: Execute, or ExecuteInThread. The choice between these two depends on how the calling application will monitor the progress of the search.
The most efficient option is to use Execute with the AutoStopLimit, TimeoutSeconds, and MaxFilesToRetrieve properties set up before the search to limit the amount of resources a search will consume. This is usually the best option for use on web servers.
Desktop-based applications usually have to monitor and control a search job more closely. Usually there will be a "Cancel" button that the user can click to halt the search, and the application may also display a running count of the number of files retrieved. The need to implement a "Cancel" button and to track the progress of the search can be handled in either of two ways: (1) callback functions implemented through the StatusHandler property of a SearchJob, or (2) searching in a separate thread.
When an application uses Execute to start a SearchJob, it can implement ReceiveFound and CheckForAbort callback functions to monitor the progress of a search and cancel the search if the user hits a "Cancel" button. ExecuteInThread starts the search job in a separate thread and returns immediately, and provides IsThreadDone and AbortThread methods to enable the calling application to cancel the search.
In an application that implements a "Cancel" button feature, ExecuteInThread is much faster than Execute because it eliminates the need for the callback functions ReceiveFound and CheckForAbort. Callbacks from Visual Basic or other COM-based languages are very time-consuming because each one involves a separate IDispatch invocation. Using Execute and the callbacks generates an IDispatch invocation for each implemented callback function, each time a document is found. This can have a very significant effect on search performance. Therefore, for applications that need to implement something like a "Cancel" button, ExecuteInThread should be used instead of Execute for SearchJobs. (In C++, .NET, and Java, callbacks are much less expensive.)
An application that uses ExecuteInThread should not call IsThreadDone() continuously in a loop to check whether the thread is done. Instead, it should call Sleep() between IsThreadDone() calls to give the searching thread time to work.
dtSearch is designed to perform fast searches for keywords in documents. In building applications with dtSearch, developers often want to use the search indexes to perform other, more database-like functions. While this is possible with dtSearch, it is important to understand the trade-offs of various query types to make efficient use of dtSearch search functions.
(1) Minimize the number of "words" matched by your query.
Consider a document manager that uses a document "ID" consisting of a two-letter state code and a number, such as ID_NY01234 or ID_CA54312. To enable user to search by state, the document manager would add "and ID_NY*" or "and ID_CA*" to the end of the query, so that only documents matching the requested state prefix would be returned.
The problem with this approach is that it generates a separate word match for each document in the state requested. If there are 20,000 New York documents, the search involves 20,000 words in addition to the user's search request.
A better approach is to design identifiers to support the types of searches you expect to perform. In this example, adding a space before the number in the document ID would dramatically improve the efficiency of the queries needed to search by state. The improved document IDs would look like "ID_NY 01234" and "ID_CA 54312". Making the state and the number separate words makes it possible to search for "ID_NY" (one word) instead of "ID_NY*" (potentially thousands of words) when looking for New York documents.
(2) Use File Conditions when possible
If for some reason it is impossible to avoid a search condition that generates thousands of word matches, consider using a "word" File Condition in your search request instead of a text search for the word. A File Condition is a condition that documents must satisfy in a search but that does not generate hits. File Condition searches can be much faster than adding words to a search request. Most commonly, File Conditions are used for things like filename patterns or date range searches (for example, a search for documents modified after December 12, 2000, or for documents named "x*.doc"). You can also require that documents satisfying a condition must contain one or more words. Examples:
xfilter(word "abc*")
Document must contain a word beginning with abc
xfilter(word "date20020101~~date20020131")
Document must contain "date" followed by a range from 20020101 to 20020131
xfilter(word "datefield::20020101~~20020131")
Document must contain a value between 20020101 and 20020131 in the field named datefield
Please see the "File Conditions" topic in the dtSearch Engine help file for more information on the syntax for adding File Conditions to a search.
(3) Avoid using text searches that will match all or most of your documents
Another feature often added to document management products is a set of document properties that users can search on. For example, a collection of agency letter rulings might have an "IsSuperseded" flag indicating that the letter ruling is no longer regarded as valid. To search for only current documents, a document manager might add "...and (IsSuperseded contains false)" to the end of every query. A more efficient way to to implement this type of searching is to apply a filter to search results after the search: implement the IsSuperseded flag as a stored field, search without the IsSuperseded part of the query, and then filter out superseded documents by checking the IsSuperseded flag in search results. This can also be done using the SearchFilter object, which provides an efficient way to limit a search to a subset of the document collection. For information on SearchFilter objects, see the "SearchFilter" topic in dtengine.chm (Visual Basic, ASP) or the dtsSearchFilter topic (C++).
(4) Use SearchFilters instead of long lists of identifiers
In some applications, it is necessary to limit each search according to a complex set of criteria. For example, each user may have access to a defined subset of documents, with each set defined as a list of thousands of document identifiers. Where the criteria are time-consuming to generate but re-used often, they can be generated once as a SearchFilter each time the index is modified, saved as compact disk files, and then applied to searches as needed.
A SearchFilter is an efficient, in-memory object that identifies a set of documents from one or more indexes. When the SearchFilter is attached to a SearchJob, the SearchJob will only return documents that are part of the SearchFilter. Because of the way they are implemented, SearchFilters are faster than any other mechanism for selecting items to be included in search results. Once a SearchFilter has been constructed, it can be saved to disk and read from disk as needed. A web-based application can keep its search filters in memory for use in searches. Once constructed, a search filter can be accessed from multiple threads simultaneously.
SearchFilters do not use names to identify documents because a filter may specify thousands, or hundreds of thousands, of documents, and a table of filenames would take too much memory and would take too long to check. Instead, each document is identified by (a) the index it belongs to, and (b) the document's DocId, a unique integer that is assigned to each document in an index. The docId for a document can be obtained by searching for the document; the document properties returned in Search Results will include the docId.
A SearchFilter is implemented in the dtSearch Engine using a table of bit vectors, one for each index in the filter. Each bit vector has one bit for each document in its index (the bit for each document corresponds to its docId). For example, a SearchFilter for a single index with 1,000,000 documents would have 1,000,000 bits, or 125 kilobytes of data. When a SearchFilters is written to disk, it is stored in a compressed format that generally takes substantially less space than the in-memory representation (which is optimized for speed).