How to display retrieved PDF documents with hits highlighted
PDF highlighting only works if a compatible PDF viewer is installed on the client computer. If the client machine's PDF viewer does not support the hit highlighting API described below, then PDF files will appear without hit highlighting.
For current information on compatible PDF viewers, please see http://support.dtsearch.com/dts0166.htm
Adobe Reader X and Adobe Acrobat X do not support the highlighting mechanism described in this article, so a plug-in is needed to enable hit highlighting to work. For information on this plug-in and a link to download the current version of the plug-in, please see http://download.dtsearch.com/pdfhl
While it is possible to convert PDF files to HTML, it is better to highlight hits directly in Adobe Reader because then all aspects of the PDF file's appearance are preserved.
When a user clicks on a link to a PDF file on a web page, the browser loads Adobe Reader as a plug-in and uses it to display the page. Adobe Reader knows how to interpret a type of URL that provides hit highlight information. The URL format looks like this:
http://www.dtsearch.com/sample.pdf#xml=http//www.dtsearch.com/hits.xml
The #xml= portion of the link points to a URL that returns an XML stream describing the location of the hits in the PDF file. The format of the XML file is described in this document, which is also included in the Acrobat SDK.
Adobe Technical Note 5172 -- Highlight File Format
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf
When the hit highlighting API works, hits will be highlighted in Adobe Reader inside the user's browser and hit navigation buttons on the Adobe Reader toolbar will let the user navigate from hit to hit.
Usually the #xml portion of the URL does not point to a text file but instead requests the XML from a script or program, like this:
http://www.dtsearch.com/sample.pdf#xml=http//www.dtsearch.com/dtsearch.asp?cmd=getPdfHits&idoc=5
The dtSearch Engine provides a MakePdfWebHighlightFile method in the SearchResults object to generate this XML stream. For sample code demonstrating this, please see the dtsearch.asp sample included with the dtSearch Engine.
This mechanism for highlighting hits is difficult to troubleshoot because it involves interaction between the web server, the browser, Adobe Reader, and your application. The most common problem is a scripting error in the implementation of the #xml= portion of the URL. For troubleshooting suggestions to resolve problems with PDF hit highlighting, see this article on the dtSearch web site:
http://support.dtsearch.com/faq/dts0152.htm
The same browser-based interface can be used to view PDF files in a client application. To display a PDF file, the application would embed a WebBrowser control and use the Navigate() function to direct the control to a URL like the ones used on web sites (above).
The Adobe interface used to highlight hits only works consistently when a PDF file is accessed via HTTP. Therefore, to highlight hits in a local PDF file, it is still necessary to send the PDF file and highlight information to Adobe Reader via HTTP.
The dtSearch Engine includes tools to support two different mechanisms to do this: (1) an in-process COM object that implements an Asynchronous Pluggable Protocol, lbvProt.dll; and (2) an out-of-process HTTP server, dts_svr.exe, that implements a local-only HTTP server.
Currently lbvProt.dll is the preferred mechanism. Because it does not use any ports and integrates directly with the browser, it does not trigger any firewall warnings.
Highlighting hits in a client application involves interaction between your program, the embedded web browser control, the Adobe Reader instance embedded in the web browser, and any security software or firewalls that may be installed on the end-user system. Changes or unexpected behavior in any one of these components can prevent the highlighting mechanism from working. Therefore, for widely-distributed applications it may be a good precaution to provide both mechanisms with a user-controllable option setting to select between the two mechanisms.
PDF files can contain attachments, which can be in any file format. If a PDF file has attachments, Adobe Reader cannot be used to display the file with hits highlighted, because Adobe Reader can only highlight hits in PDF content. Therefore, when a PDF file has attachments, hit highlighting can only be done by file conversion.
PDF files with attachments will have the TypeId it_PdfWithAttachments instead of it_PDF.
To make it possible to treat PDF files with attachments like other PDF files, you can suppress indexing of attachments. In this case, only the pages and properties of the PDF file itself will be indexed. To suppress indexing of attachments, set the flag dtsoFfPdfSkipAttachments in Options.FieldFlags.
Language Analyzer API
PDF hit highlighting inside Adobe Reader does not currently work if documents were indexed using a word breaker integrated using the language analyzer API. The only kind of hit highlighting that is supported in combination with the language analyzer API is conversion of files using FileConverter.
|
Topic |
Description |
|
dts_svr.exe provides a way to highlight hits in PDF files using a local-only HTTP server | |
|
lbvProt.dll provides a way to highlight hits in PDF files using an Asynchronous Pluggable Protocol. This mechanism only works with an in-process WebBrowser control. |
|
Copyright (c) 1995-2012 dtSearch Corp. All rights reserved.
|