How to display retrieved documents with hits highlighted
To display a retrieved document with hits highlighted, the dtSearch Engine provides APIs that can convert documents to easily-displayed formats (HTML, RTF, or text) with caller-supplied highlight markings around the hits.
To highlight hits in a document, dtSearch needs the following information:
When highlighting hits in a document retrieved from a search, use FileConverter.SetInputItem() to transfer all document properties from search results to the FileConverter in one step, eliminating the need to set items (1) through (4) individually.
For information on highlighting hits in specific formats, see:
Highlighting hits in HTML files
Highlighting hits in PDF files
Highlighting hits in XML files
For information on implementing hit navigation, see:
If you seen incorrect hit highlighting in a document after a search,
(1) Check that you are using FileConverter.SetInputItem to ensure that all necessary document properties are transferred from SearchResults.
(2) Check that the document was not changed since it was last indexed.
(3) Check that the dtSearch version used to index the document is the same as the version being used to highlight hits.
Using indexes created with older dtSearch versions can result in hit highlighting errors if the newer version includes a file parser changes that affect text extraction or word breaking. A minor upgrade between two recent dtSearch versions is unlikely to affect more than very small number of documents. However, upgrading from a much older version such as 6.5 to the current version will affect many documents, and in these cases rebuilding indexes created with the older version is recommended to avoid hit highlighting errors.
HTML output from FileConverter may not be well-formed. For example, it may not contain exactly more than one <HTML>...</HTML> pair of tags. The reason is that it extracts pieces of HTML from different places depending on the file format and has to splice them all together. For example, an email will often include one or more email bodies in HTML, attachments that may be in HTML, and attachments in other formats that have to be converted to HTML. While it would be possible to scan the HTML output for errors in HTML syntax, this would require a potentially time-consuming and memory-consuming full additional pass through the converted data, before anything is returned. If you need this in your application, you can use the library provided by this open-source project to add a post-conversion step to clean up the HTML: http://tidy.sourceforge.net
|
Copyright (c) 1995-2012 dtSearch Corp. All rights reserved.
|