Troubleshooting PDF hit highlighting (dtSearch Engine)

Article: dts0152

Applies to: dtSearch Engine

This article covers problems that developers may encounter displaying PDF files with hit highlighting in Adobe Reader through a browser interface.

Troubleshooting PDF Indexing

PDF file appears as blank page in dtSearch Web

PDF Files appear without hit highlighting in dtSearch Web

How PDF hit highlighting works

When a user clicks on a link to a PDF file on a web page, the browser loads Adobe Reader as a plug-in and uses it to display the page. The URL with the PDF filename can include a second URL that specifies words to highlight. The URL format looks like this:

https://www.dtsearch.com/sample.pdf#xml=https://www.dtsearch.com/hits.xml

The URL must use https. If the highlight URL uses http, the highlighter will not be able to access the highlight data.

The #xml= portion of the link points to a URL that returns an XML stream describing the location of the hits in the PDF file.

Current Adobe Reader versions require a plug-in to enable hit highlighting. For information on the plug-in and a link to download it, please see https://www.dtsearch.com/pdfhl/

When the hit highlighting API works, hits will be highlighted in Adobe Reader inside the user's browser and hit navigation buttons on the Adobe Reader toolbar will let the user navigate from hit to hit. If it does not work, often the only symptom you will see is the absence of hit highlighting.

Usually the #xml portion of the URL does not point to a text file but instead requests the XML from a script or program, like this:

https://www.dtsearch.com/sample.pdf#xml=https://www.dtsearch.com/dtsearch.asp?cmd=getPdfHits&idoc=5

The dtSearch Engine provides a MakePdfWebHighlightFile method in the SearchResults object to generate this XML stream.

Troubleshooting steps

1. Verify that PDF highlighting works on the client machine

To test PDF highlighting on the client machine, open this link in Microsoft Edge and enable "IE Mode": https://support.dtsearch.com/pdftest

To setup Edge in IE Mode to use Adobe Acrobat Reader, see How to use Edge to view PDF files with hit highlighting

If PDF highlighting does not work on the client machine, please see these articles for troubleshooting steps:

Troubleshooting PDF viewing problems in dtSearch Desktop/Network

Troubleshooting PDF hit highlighting problems in dtSearch Web

2. Check that the PDF files can be indexed, searched and highlighted

Index the PDF files with dtSearch Desktop and try searching. If searching does not work, please see:

Troubleshooting PDF indexing

3. Check the URLs that your application generates

Check that the URLs your application generates have the right format. The format for the URLs that provide hit highlighting information is:

https://www.example.com/sample.pdf#xml=http//www.example.com/hits.xml

The #xml= portion of the link points to a URL that returns an XML stream describing the location of the hits in the PDF file.

4. Check that your application has implemented the IsPdfHighlighter protocol

To prevent use of the plug-in to send forged requests to web sites, the plug-in will send a standard validation request to make sure the target URL is really a PDF search highlighter. The validation request replaces the query in the original URL with "IsPdfHighlighter", and expects a response that contains "YesPdfHighlighter".

For example, suppose a user clicks this link:

https://www.example.com/harmless.pdf#xml=https://www.example.com/gethits.aspx?search=123456

Adobe Reader will open the PDF file, and the dtSearch plug-in will see the #xml= in the URL. To verify the target URL, before requesting the highlighting data, the dtSearch plug-in will first send this request:

https://www.example.com/gethits.aspx?IsPdfHighlighter

To test your script, you can enter the URL for your highlighting script in a browser window with the IsPdfHighlighter request added and check that the response includes YesPdfHighlighter.

5. Generate a diagnostic log on a client machine where highlighting does not work

See How to generate a diagnostic log from the dtSearch PDF Search Highlighter, below, for instructions to turn on diagnostic logging. In the diagnostic log:

- Check that the URL detected in the log includes the #xml= syntax. If you do not see the #xml= syntax in the log, either your application may not be generating the #xml= links, or, if you are using Internet Explorer, then the dtSearch PDF Search Highlighter BHO may be disabled in Internet Explorer.

- Check for any error messages recorded in the log identify any security issues that prevented highlighting from working.

- Check that the IsPdfHighlighter query was processed correctly with a response that includes "YesPdfHighlighter"

- Check that the XML your application returns is correctly formatted and does not include any extra content such as HTML headers or messages.

6. Check that the XML stream returned from your application is correct.

Save the results of the view-source URL above from Notepad to a file named test.xml in the root folder of your web server. Save the PDF file to test.pdf in the same folder. Open your browser and enter the following URL, replacing "localhost" with the address of your web site, if appropriate:

https://localhost/test.pdf#xml=https://localhost/test.xml

If the XML stream is correct, test.pdf should appear in a browser window with hits highlighted. If hits are not highlighted, check the format of the information in test.xml against the Adobe documentation of the Highlight File Format (see link above).

How to generate a diagnostic log from the dtSearch PDF Search Highlighter

To generate a diagnostic log:

(1) Run the dtspdfcfg.exe utility (click Start > Programs > dtSearch Pdf Search Highlighter > dtSearch PDF Search Highlighter Options).

(2) Click Diagnostics... and check the box to Enable diagnostic logging.

(3) Try to open a PDF file that should have highlighting.

(4) Close all browser windows and all Adobe Reader windows

(5) In the dtSearch PDF Search Highlighter Options program, click Zip logs for email to find the diagnostic logs.

Monitoring the log using dbgview.exe

You can also monitor the log in real time using the dbgview utility from the Microsoft Sysinternals web site. To use dbgview.exe to monitor highlighting, first open dbgview.exe and then open a browser window and execute a search. You should see diagnostic messages from the highlighter appear in dbgview.exe as soon as a PDF file opens.

Because of browser or Adobe Reader sandboxing, you may need to run dbgview as a limited user to see the log. Currently this is necessary when using Internet Explorer, but not with Chrome. To run dbgview as a limited user, use the psexec utility (also available from the Microsoft Sysinternals web site) to launch dbgview.exe with the -l command-line switch, like this:

psexec -l dbgview.exe