How to use OCR output files with dtSearch or dtSearch Web

Last Reviewed: May 17, 2015

Article: DTS0167

 

Applies to: dtSearch 6, 7

See also:   Adobe Reader X and XI information, PDF viewers that support highlighting hits

Scanned documents are usually stored as TIFF images, which are then converted using OCR (optical character recognition) into a text format such as HTML or Microsoft Word. Using dtSearch, dtSearch Desktop, or dtSearch Web, these text documents can then be indexed and made searchable.

In many cases, it is necessary to provide access both to the searchable text and the original image file, so that users can see exactly what the original document looked like. Because web browsers cannot display TIFF images without an image-viewing plug-in, the image files must be converted into another format if web access is needed.

Using PDF to combine images and text

The PDF file format provides two ways to combine images and text in a single file. First, the "searchable image" or "image with hidden text" format stores the complete original page images, along with the text obtained through OCR. The text is "hidden" because, when a user opens the PDF, the user only sees the scanned image, not the text. Because the text is also in the file, dtSearch can index and search it. After a search, dtSearch, or dtSearch Web, can highlight hits directly on the scanned image when the image is displayed in Adobe Reader.

Another option for combining scanned images and text in a single PDF file uses small images for the parts of each scanned page that do not appear to be text, and uses fonts created from the scanned letters in the text. For example, a picture or a signature would be stored as a small image embedded in the page, while the rest of the page would be converted to text. This format produces much smaller files than the first alternative, because only a few small images are stored for each page, instead of a complete image of the whole page. Additionally, the text detected through OCR often becomes more readable, because it is stored as text with font information rather than as an image.

For more information on the benefits of each type of OCR, these Adobe articles may be helpful:

Better PDF OCR. ClearScan is smaller, looks better
http://blogs.adobe.com/acrolaw/2009/05/better_pdf_ocr_clearscan_is_smal/

Comparing Scanned Documents Tips and Workarounds
http://blogs.adobe.com/acrolaw/2012/09/comparing-scanned-documents-tips-and-workarounds/

The PDF format is ideal for use on the web because multiple pages of images and text can be combined into a single, compressed file; anyone with a web browser and the free Adobe Reader viewer can view the files; and the text can be searched using dtSearch Web.

Many OCR products that can generate PDF files from scanned images, including Adobe Acrobat, www.adobe.com.

Searching PDF files with dtSearch Web, dtSearch Desktop, or dtSearch Network

Once you have created PDF files with one of the OCR products listed above, you can index and search them with dtSearch Web just like any other documents. After a search, dtSearch Web will display a list of retrieved documents. When a user clicks on one of the documents, dtSearch Web will display the PDF file, with hits highlighted. If the PDF file is in the "image with hidden text" format, dtSearch Web will highlight hits directly on the image.

For a demo, see http://support.dtsearch.com/ocr/