Troubleshooting PDF indexing

Article: dts0108

 

dtSearch reports file is encrypted, but it can be opened in Adobe Reader

If a PDF file has a security password, dtSearch may not be able to open it to extract the text for indexing.

A PDF file may have a security password even if no password is needed to open it in Adobe Reader. For example, the password may prevent printing the document, changing the document, adding annotations, etc.

To see if a PDF file has a security password, open the file in Adobe Reader and click File > Properties > Security. A dialog box will appear that will tell you if the file has a password. For information on indexing PDF files that have security, please see Security passwords on PDF files.

dtSearch reports file is corrupt, but it can be opened in Adobe Reader

Adobe Reader and Adobe Acrobat will automatically fix some file corruption problems in PDF files when a PDF file is opened.

To fix a single PDF file, open it in Adobe Acrobat and save it using File > Save As.  This will usually fix any problems in the file and will also optimize the file for faster viewing.  After saving the PDF file in Adobe Acrobat, try to index it again in dtSearch.

To fix a large number of PDF files at once, you can use the "Action Wizard" in Adobe Acrobat Professional.

PDF file is indexed without errors but no text is searchable

Some PDF files contain either pure image data or text but no encoding information.  In either of these cases, there is no text in the PDF file that can be indexed, and OCR is needed to add text to the PDF file.  For information on OCR tools that can add text to a PDF file, see How to use dtSearch or dtSearch Web with OCR.

Check whether the PDF file contains text

Some PDF files are nothing more than a PDF wrapper around a TIFF image, with no text in the file. To see if a PDF file contains text,

1.   Open the file in Adobe Reader

2.   Click on some text and try to select it with the mouse

3.   If Adobe Reader draws a rectangle instead of selecting blocks of text in blue, then the file is an image with no text.

Check whether the PDF file contains valid encoding information

Some PDF files contain text but use an encoding that is meaningless outside of the PDF file. For each character, the PDF file contains embedded font information that describes how to draw the PDF file, but the characters do not correspond to an encoding that can be used to extract text from the file. As a result, the PDF file looks like a normal document but there is no meaningful text in the file.

To see whether a PDF has valid encoding information,

1.   Open the file in Adobe Reader

2.   Click on some text and select it with the mouse

3.   Click Edit > Copy

5.   Open Notepad, Microsoft Word, or another program that can accept pasted text.

6.   Click Edit > Paste, or press Ctrl+V

If you see what looks like random letters instead of the text you copied from the PDF file, the PDF file lacks encoding information.