Implementing Document Classification

Article: dts0179

Applies to: dtSearch Engine

Often it is useful to supplement a keyword search with a criterion that relates to a document's classification. For example, in a collection of HTML documents about cars, an option could be provided to limit a search by brand (Ford, Honda, etc.) or vehicle type (sedan, SUV, minivan).

Building classification information into the index

The first step to implementing document classification is to associate one or more fields with each document that will be used to define the categories. With cars, there could be a "Brand" field and a "Type" field. The field can either be embedded in the document (for example, as a META tag in an HTML file, or a document property in a Word document), or it could be associated with a document when the file is indexed, using the data source API. Each HTML file in the document collection on cars could contain two meta tags, like this:

<meta name="Brand" content="Ford">

<meta name="Type" content="sedan">

For more information on ways to associate field information with documents in an index, see:

How to add fields to documents during indexing

"Adding fields to documents" in the dtSearch Engine help file, dtengine.chm

Searching using classifications (option 1 - field searches)

Once the classification information is built into the index, it can easily be added to a search request. If a user indicates that he only wants to search for documents on Ford cars, the search request would be modified by adding a field search of the Brand field on Ford, like this:

[1] (user request) and (brand contains ford)

If the user wants to search only documents on Ford cars that are sedans, the search request would be modified by adding two field searches, one specifying that Brand must contain Ford, and the other specifying that the Type must be sedan:

[2] (user request) and ((brand contains ford) and (type contains sedan))

Searching using classifications (option 2 - xfilter searches)

While field searches are effective at classifying documents, in very large document collections they are relatively inefficient. Suppose, for example, that the collection of documents about cars contained 20,000,000 HTML files. The "brand contains ford" expression would generate millions of document matches, and the "type contains sedan" would also generate millions of matches. Adding these two criteria to a search will, therefore, greatly increase the amount of work that must be done to resolve the search request. In large document collections, optimizing these types of queries is critical for good performance.

One way to do this is with a type of query expression, an "xfilter" search, that is designed to implement a simple filter for documents, without considering hit locations or checking word proximity information. Documents either match the filter and can be retrieved, or they do not and are excluded. An xfilter can be combined with a boolean query using the standard and/or/not connectors, and can be Based on the document name, date, or size, or the presence in the document of a word. Examples:

[3] (user request) and xfilter(name "abc*.html")

This would match any document that contains (user request) and that is named abc*.html.

[4] (user request) and xfilter(word "sedan")

This would match any document that contains (user request) and that also contains the word "sedan". (Because xfilters are resolved without considering word locations, there is no way to specify a phrase in an xfilter. Only single words can be included in xfilter expressions.)

To add a field restriction to a word in an xfilter expression, add the field name before the word, separated by ::, like this:

[5] (user request) and (xfilter(word "Type::sedan") and xfilter(word "Brand::Ford"))

While this search request is logically equivalent to [2] above, it is much faster. Because dtSearch knows that it can ignore word locations, it can optimize the query, greatly reducing the time needed to process the two classification criteria.

Searching using classifications (option 3 - SearchFilter objects)

Another approach to classifications can take advantage of the fact that, in most cases, a relatively limited number of categories will be used repeatedly in searches. Rather than processing the classification criteria for every search, the classification criteria can be generated once each time the index is updated, and then reused for searches. The SearchFilter object provides a way to do this.

A SearchFilter is an efficient, in-memory object that identifies a set of documents from one or more indexes. When the SearchFilter is attached to a SearchJob, the SearchJob will only return documents that are part of the SearchFilter. Because of the way they are implemented, SearchFilters are faster than any other mechanism for selecting items to be included in search results. Once a SearchFilter has been constructed, it can be saved to disk and read from disk as needed. A web-Based application can keep its search filters in memory for use in searches. Once constructed, a search filter can be accessed from multiple threads simultaneously.

SearchFilters do not use names to identify documents because a filter may specify thousands, or hundreds of thousands, of documents, and a table of filenames would take too much memory and would take too long to check. Instead, each document is identified by (a) the index it belongs to, and (b) the document's DocId, a unique integer that is assigned to each document in an index. The docId for a document can be obtained by searching for the document; the document properties returned in Search Results will include the docId.

A SearchFilter is implemented in the dtSearch Engine using a table of bit vectors, one for each index in the filter. Each bit vector has one bit for each document in its index (the bit for each document corresponds to its docId). For example, a SearchFilter for a single index with 1,000,000 documents would have 1,000,000 bits, or 125 kilobytes of data. When a SearchFilters is written to disk, it is stored in a compressed format that generally takes substantially less space than the in-memory representation (which is optimized for speed).

For more information on SearchFilters, see:

Limiting searches with SearchFilters

dtsSearchFilter (C++)

dtSearch.Engine.SearchFilter (.NET)

com.dtSearch.engine.SearchFilter (Java)

SearchFilter (COM)