|
dtSearch Support |
Last Reviewed: June 10, 2005
Article: DTS0151
Applies to: dtSearch Engine 6, 7
This article covers problems that developers may encounter indexing PDF files with the dtSearch Engine. For information on indexing PDF files with the dtSearch Desktop program, see Troubleshooting PDF Indexing.
Symptoms of PDF indexing problems
1. An error code is returned when your application attempts to index a PDF file
2. Garbage data appears in the word list in an index
3. Words that appear to be in the PDF files are not searchable
4. Hit highlighting appears random
5. Only English text appears to be indexed correctly
Potential Causes
1. The PDF files are locked with a security password
2. The PDF files do not contain any text
3. The PDF files contain text but have no encoding information
Troubleshooting Steps
Check whether the PDF file has a security password
If a PDF file has a security password, dtSearch may not be able to open it to extract the text for indexing. To see if a PDF file has a security password, open it in Adobe Reader and click File|Document Info|Security. A dialog box will appear that will tell you if the file has a password. For information on indexing PDF files that have security, please see "Security passwords on PDF files."
Check whether the PDF file contains text
Some PDF files are nothing more than a PDF wrapper around a TIFF image, with no text in the file. To see if a PDF file contains text,
1. Open the file in Adobe Reader
2. Select the Text Select Tool (press
"V" or click on the
icon in the toolbar)
3. Try to select some text.
Check whether the PDF file contains valid encoding information
Some PDF files contain text but use an encoding that is meaningless outside of the PDF file. For each character, the PDF file contains embedded font information that describes how to draw the PDF file, but the characters do not correspond to an encoding that can be used to extract text from the file. As a result, the PDF file looks like a normal document but there is no meaningful text in the file. For more information, see the "PDF" section in Unicode support in dtSearch 6.
To see whether a PDF has valid encoding information,
1. Open the file in Adobe Reader
2. Select the Text Select Tool (press
"V" or click on the
icon in the toolbar)
3. Select a block of text
4. Click Edit|Copy
5. Open Notepad, Microsoft Word, or another program that can accept pasted text.
6. Click Edit|Paste
If you see what looks like random letters instead of the text you copied from the PDF file, the PDF file lacks encoding information.