|
dtSearch Support |
Last Reviewed: January 24, 2007
Article: DTS0152
Applies to: dtSearch Engine 6, 7
This article covers problems that developers may encounter displaying PDF files with hit highlighting in Adobe Reader through a browser interface.
Related articles
Troubleshooting PDF Indexing (dtSearch Desktop)
Troubleshooting PDF Indexing (dtSearch Engine)
PDF file appears as blank page in dtSearch Web
PDF Files appear without hit highlighting in dtSearch Web
How PDF hit highlighting works
When a user clicks on a link to a PDF file on a web page, the browser loads Adobe Reader as a plug-in and uses it to display the page. Adobe Reader knows how to interpret a type of URL that provides hit highlight information. The URL format looks like this:
http://www.dtsearch.com/sample.pdf#xml=http//www.dtsearch.com/hits.xml
The #xml= portion of the link points to a URL that returns an XML stream describing the location of the hits in the PDF file. The format of the XML file is described in this document, which is also included in the Acrobat SDK.
Adobe Technical Note 5172 -- Highlight File Format
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf
When the hit highlighting API works, hits will be highlighted in Adobe Reader inside the user's browser and hit navigation buttons on the Adobe Reader toolbar will let the user navigate from hit to hit. If it does not work, often the only symptom you will see is the absence of hit highlighting.
Usually the #xml portion of the URL does not point to a text file but instead requests the XML from a script or program, like this:
http://www.dtsearch.com/sample.pdf#xml=http//www.dtsearch.com/dtsearch.asp?cmd=getPdfHits&idoc=5
The dtSearch Engine provides a MakePdfWebHighlightFile method in the SearchResults object to generate this XML stream. For sample code demonstrating this, please see the dtsearch.asp sample included with the dtSearch Engine.
Troubleshooting steps
1. Check that the XML stream is being generated.
If you are generating XML using an ASP script, a script error may be preventing the XML from being generated correctly. Because of the way PDF links are created, this script error would not normally be visible.
To check the XML stream returned by your application, first generate a link that should display a PDF file with hit highlighting, and then cut and paste the #xml= portion of the link into your Internet Explorer address bar with view-source: in front, like this:
view-source:http://www.example.com/SampleScript.asp?getPdfHits&...
The XML will appear in Notepad. (If you just enter the URL of the XML stream directly, the browser will refuse to display it because Adobe's XML hit highlight stream is not correct XML.) You may see a script error message, indicating why the XML stream was not generated.
2. Check that the XML stream is correct.
To do this, save the results of the view-source URL above from Notepad to a file named test.xml in the root folder of your web server. Save the PDF file to test.pdf in the same folder. Open your browser and enter the following URL, replacing "localhost" with the address of your web site, if appropriate:
http://localhost/test.pdf#xml=http://localhost/test.xml
If the XML stream is correct, test.pdf should appear in a browser window with hits highlighted. If hits are not highlighted, check the information in test.xml against the Adobe documentation of the Highlight File Format (see link above).
If incorrect XML is being generated, use the debug logging feature in the dtSearch Engine to generate a log of your request for the XML stream to see what is going wrong. For information on debug logging with the dtSearch Engine, see: "Diagnostic Tools for Visual Basic and ASP Developers" and Diagnostic tools for .NET developers.
3. Check the way you are returning the XML stream
The hit highlighting process is very sensitive to variations in the way you return the XML stream. See the GetPdfHits function in the dtsearch.asp sample included with the dtSearch Engine for sample code. Your script should: (1) indicate that the content type of the data is "text/xml", (2) set CacheControl to "no-cache" (this prevents old XML streams from being re-used), and (3) avoid using pragma no-cache, which we have found will prevent hit highlighting from working. Example:
Response.ContentType = "text/xml"
' DO NOT USE THIS: Response.AddHeader "Pragma", "no-cache"
' INSTEAD, USE THIS:
Response.CacheControl = "no-cache"
' Reconstruct search results for the item the user clicked on
Dim res
Set res = Engine.NewSearchResults
res.UrlDecodeItem(Request.ServerVariables("QUERY_STRING"))
res.getNthDoc(0)
Dim hits
hits = res.MakePdfWebHighlightFile()
response.write(hits)
Set res = nothing
4. Avoid returning framesets
A frameset is a convenient way to display search results, because the list of items can appear in one frame with the currently-selected document in the other. When setting up framesets, we recommend that you create the frameset in your search form, as dtSearch Web does, rather than trying to generate the frameset dynamically in response to a search.
It is possible to generate framesets with dynamically-generated content in the src= links, so that when a user clicks on a link the PDF file will appear inside a frame with other frames displaying other content relating to the search. The src= link for the PDF file would contain the same hit-highlighting URL that you would use if linking to the document directly.
There are two problems with this approach:
(1) When a link asks the browser to open a PDF file in a new window or frame, as opposed to one that already exists when the link is clicked, sometimes the result is a blank page.
(2) The src= links in framesets do not seem to work as consistently as href= links when highlighting hits.
5. Check your site with different browser versions
Some problems only affect specific browser versions. For example, pragma no-cache (mentioned in step 3 above) prevents hit highlighting from working in Netscape 4.72 but not in Internet Explorer 5.5.