How to use dtSearch Web with PDF files

Article: dts0116

 

Applies to: dtSearch Web

See also:
How to use dtSearch or dtSearch Web with OCR


Adobe Reader DC, XI, and X information
PDF viewers that support highlighting hits
Security passwords on PDF files

dtSearch Web can index and search PDF files and can highlight hits in retrieved files.  For information on highlighting hits in retrieved PDF files using the dtSearch Engine API, see "Highlighting hits in PDF files" in the dtSearch Engine API reference.

Server requirements

1.   Install dtSearch Web.  

2.   Copy the PDF files to a folder that is defined as virtual directory in IIS. A subdirectory of the root folder for your web site (i.e., c:\inetpub\wwwroot\docs) will work.

4.   Index the PDF files with dtSearch Desktop. See the dtSearch Quick Start for more information on how to index documents with dtSearch.

5.   Use dtSearch Web's "Build Search Form" to build a search form. See the dtSearch Web Quick Start for more information.

Setting up the client machines

Client machines will need a plug-in to highlight hits in PDF files using Adobe Reader, and Microsoft Edge must be configured to allow Adobe Reader to run.  For more information, please see How to use Edge to view PDF files with hit highlighting.

The color used to highlight hits in Adobe Reader is controlled by the client machine's Display Options in Windows. The specific option is the "Selected Items" color, which defaults to white-on-blue. If you change it to black-on-yellow, Adobe Reader will show highlights in yellow. (This will also make your Windows menus and list boxes use yellow highlights.) There is no way to control this from the server.

Optimizing PDF files

Adobe Acrobat can restructure PDF files so they can open more quickly in a web browser.  This Adobe article explains how it works:
https://helpx.adobe.com/acrobat/using/optimizing-pdfs-acrobat-pro.html

Using the Action Wizard in Adobe Acrobat, you can optimize an entire folder of PDF files in one step:
https://helpx.adobe.com/acrobat/using/action-wizard-acrobat-pro.html

Identifying PDF files that are not searchable

The dtSearch Desktop indexer generates reports in the index folder that identify files that were encrypted, corrupt, partially-encrypted or partially-corrupt.  It will also identify image-only PDF files, which are PDF files that do not contain any page text, often an indication that OCR is needed.

To see the list of files in HTML format, click "View Log" in the Update Index dialog box, or open the file Index_LastUpdateErrors.html in the index folder.

To see the lists of files in each category in plain-text format, open the IndexLog_*.txt files in the index folder.

Server-side option  

As an alternative to using the dtSearch plug-in, a server-side only PDF hit-highlighting solution is available from Contegra Systems, Inc. This solution bypasses the need to have each end-user separately install software to re-add hit highlights back to PDFs.  For more information, please contact https://contegrasystems.com