How to display retrieved documents with hits highlighted.
To display a retrieved document with hits highlighted, the dtSearch Engine provides APIs that can convert documents to easily-displayed formats (HTML, RTF, or text) with caller-supplied highlight markings around the hits.
To highlight hits in a document, dtSearch needs the following information:
1. Input file name – The name of the file to use as input.
2. Hit word offsets – Usually obtained from search results.
3. Alphabet file location – Used for word breaking. For indexed searches, this should be the index folder. Note: If the alphabet file used during indexing differs from the one used for highlighting, word counts may be incorrect due to different word break rules.
4. Index location and document ID – The location of the index the document was retrieved from, and the document ID within the index.
5. Output format – One of: itHTML, itRTF, itUtf8, or itAnsi.
6. Highlight markings – The markings to use around each hit.
When highlighting hits in a document retrieved from a search, use FileConverter.SetInputItem() to transfer all document properties from search results to the FileConverter in one step. This eliminates the need to set items (1) through (4) individually.
Highlighting hits in HTML files
Highlighting hits in PDF files using annotations
If you see incorrect hit highlighting in a document after a search, check the following: 1. Ensure you are using FileConverter.SetInputItem to transfer all necessary document properties from SearchResults. 2. Verify that the document has not changed since it was last indexed. 3. Confirm that the dtSearch version used to index the document matches the version used for highlighting hits.
Using indexes created with older dtSearch versions can result in hit highlighting errors if newer versions include file parser changes that affect text extraction or word breaking.
dtSearch can automatically correct these types of errors by re-scanning the document for the search request when highlighting hits. To enable this option, set the flag dtsConvertAutoUpdateSearch in FileConverter.
HTML output from FileConverter may not be well-formed. For example, it may contain more than one <HTML>...</HTML> pair of tags. This is because dtSearch extracts pieces of HTML from different places depending on the file format and splices them together. For example, an email may include multiple message bodies in HTML, attachments in HTML, and other attachments converted to HTML.
While it is possible to scan the HTML output for syntax errors, this would require a potentially time-consuming and memory-intensive additional pass through the converted data before anything is returned. If your application requires well-formed HTML, you can use the library provided by this open-source project to add a post-conversion cleanup step:
Html Tidy Project