dtSearch Text Retrieval Engine Programmer's Reference
Highlighting hits - overview

How to display retrieved documents with hits highlighted.

To display a retrieved document with hits highlighted, the dtSearch Engine provides APIs that can convert documents to easily-displayed formats (HTML, RTF, or text) with caller-supplied highlight markings around the hits. 

To highlight hits in a document, dtSearch needs the following information:

  1. The name of the file to use as input
  2. The word offsets of the hits, usually obtained from search results
  3. The location of the alphabet file to use for word breaking. For indexed searches, this should be the index folder. (If the alphabet file used when a file is indexed is different from the one used when generating hit highlighting, word counts may be incorrect because different characters are treated as causing a word break.)
  4. The location of the index the document was retrieved from, and the document id of the document in the index.
  5. The output format (it_HTML, it_RTF, it_Utf8, or it_Ansi)
  6. The markings to be used around each hit.

When highlighting hits in a document retrieved from a search, use FileConverter.SetInputItem() to transfer all document properties from search results to the FileConverter in one step, eliminating the need to set items (1) through (4) individually. 

For information on highlighting hits in specific formats, see: 

Highlighting hits in HTML files 

Highlighting hits in PDF files 

Highlighting hits in XML files 

For information on implementing hit navigation, see: 

Hit navigation 

Hit highlighting errors

If you see incorrect hit highlighting in a document after a search, 

(1) Check that you are using FileConverter.SetInputItem to ensure that all necessary document properties are transferred from SearchResults. 

(2) Check that the document was not changed since it was last indexed. 

(3) Check that the dtSearch version used to index the document is the same as the version being used to highlight hits. 

Using indexes created with older dtSearch versions can result in hit highlighting errors if the newer version includes a file parser changes that affect text extraction or word breaking. 

dtSearch can automatically correct for these types of errors by re-scanning the document for the search request when highlighting hits. To enable this option, set the flag dtsConvertAutoUpdateSearch in FileConverter.

HTML output

HTML output from FileConverter may not be well-formed. For example, it may not contain exactly more than one <HTML>...</HTML> pair of tags. The reason is that dtSearch extracts pieces of HTML from different places depending on the file format and has to splice them all together. For example, an email will often include one or more message bodies in HTML, attachments that may be in HTML, and attachments in other formats that have to be converted to HTML. While it would be possible to scan the HTML output for errors in HTML syntax, this would require a potentially time-consuming and memory-consuming full additional pass through the converted data, before anything is returned. If you need this in your application, you can use the library provided by this open-source project to add a post-conversion step to clean up the HTML: External linkHtml Tidy Project -