File Parser API

Use to add support for custom file formats to the dtSearch Engine

Remarks

The File Parser API makes it possible to add support for custom file formats to the dtSearch Engine. For information on integrating an external file parser DLL into dtSearch Desktop, see the dtsMakeViewerInfo help topic. The File Parser API is currently available from C/C++ only.

Documents and Containers

There are two types of parsers: Document parsers and Container parsers. A Document parser extracts text from a document. A Container parser enumerates and extracts Documents stored in a container file. A WordPerfect parser would be an example of a Document parser. A PKZIP parser would be an example of a Container parser.

Overview of the File Parser API

A parser is added to the dtSearch indexing engine through a call to dtsRegisterViewer. A dtsViewerInfo structure is passed to dtsRegisterViewer. A dtsViewerInfo contains information and function pointers telling the dtSearch engine how to determine if the parser should be used for a particular document, how to extract text from the document, and, for containers, function pointers to use for enumerating and extracting documents from the container.

For an example of a complete file parser, see the ExternalFileParser sample included with the dtSearch Engine. This parser works with a file format in which the letters a-m are translated to n-z and the letters n-z are translated to a-m (ROT13).

Everything dtSearch needs to know about a viewer is contained in the dtsViewerInfo structure. At startup, an external parser must create a dtsViewerInfo and register the dtsViewerInfo by calling dtsRegisterViewer. The calling application is responsible for making sure this happens.

How the dtSearch Engine Works with File Parsers

For each file that the dtSearch Engine indexes or searches, the following process is used:

Identify the file parser that should process the file, based on information in the recognitionSignature element of the file parser's dtsViewerInfo and the requested output format (see Output Formats below).
Call the makeViewer function supplied in the dtsViewerInfo to build a file parser object.
Using the handle returned by makeViewer, call the getFileInfo function to get basic information about the file (size, date, name). If the size is 0 the file will be ignored.
Repeatedly call the getTextBlock function to get text to index until getTextBlock returns a block with a length of 0. The Engine may call gotoBookMark with a bookmark pointer of 0 to rewind the file. getTextBlock calls after a rewind should start returning text from the beginning of the file again.
Call destroyViewer to destroy the object identified by the handle returned by makeViewer.

Group

C++ API

Topics

Topic	Description
Document File Parsers	A document file parser translates an input document into the requested output format.
Container File Parsers	A container file parser provides an interface to enumerate the documents within a container.
File Format Recognition	Each file parser provides a dtsViewerInfo that describes which documents that file parser should handle.
Output Formats	File parsers can return text in RTF or UTF8.