Links
dtSearch Text Retrieval Engine Programmer's Reference
File Parser API
C++ API | Send Feedback

Use to add support for custom file formats to the dtSearch Engine

Remarks

The File Parser API makes it possible to add support for custom file formats to the dtSearch Engine. For information on integrating an external file parser DLL into dtSearch Desktop, see the dtsMakeViewerInfo help topic. The File Parser API is currently available from C/C++ only. 

Documents and Containers 

There are two types of parsers: Document parsers and Container parsers. A Document parser extracts text from a document. A Container parser enumerates and extracts Documents stored in a container file. A WordPerfect parser would be an example of a Document parser. A PKZIP parser would be an example of a Container parser. 

Overview of the File Parser API 

A parser is added to the dtSearch indexing engine through a call to dtsRegisterViewer. A dtsViewerInfo structure is passed to dtsRegisterViewer. A dtsViewerInfo contains information and function pointers telling the dtSearch engine how to determine if the parser should be used for a particular document, how to extract text from the document, and, for containers, function pointers to use for enumerating and extracting documents from the container. 

For an example of a complete file parser, see vw_rot13.cpp, included with the dtSearch Engine. This parser works with a file format in which the letters a-m are translated to n-z and the letters n-z are translated to a-m. 

Everything dtSearch needs to know about a viewer is contained in the dtsViewerInfo structure. At startup, an external parser must create a dtsViewerInfo and register the dtsViewerInfo by calling dtsRegisterViewer. The calling application is responsible for making sure this happens. 

How the dtSearch Engine Works with File Parsers 

For each file that the dtSearch Engine indexes or searches, the following process is used:

  1. Identify the file parser that should process the file, based on information in the recognitionSignature element of the file parser's dtsViewerInfo and the requested output format (see Output Formats below).
  2. Call the makeViewer function supplied in the dtsViewerInfo to build a file parser object.
  3. Using the handle returned by makeViewer, call the getFileInfo function to get basic information about the file (size, date, name). If the size is 0 the file will be ignored.
  4. Repeatedly call the getTextBlock function to get text to index until getTextBlock returns a block with a length of 0. The Engine may call gotoBookMark with a bookmark pointer of 0 to rewind the file. getTextBlock calls after a rewind should start returning text from the beginning of the file again.
  5. Call destroyViewer to destroy the object identified by the handle returned by makeViewer.

How to Write a Document File Parser

  1. Determine how the file parser will recognize file formats that it handles. The recommended way to recognize a file format is by a unique signature in the beginning of a file. Fill in the recognitionSignature and recognize members of the dtsViewerInfo with the information needed to recognize the file format that the parser will handle.
  2. Implement an object that parses the file format into text. The constructor for the object should take a dtsInputStream. The object should attach the dtsInputStream to a dtsInputStreamReader, which the object should use to read data from the document.
  3. Write static functions to handle each of the callbacks needed by the Engine.

makeViewer2 should create an instance of the file parser object and should return a ViewerHandle identifying the object. The easiest way to do this is to use new to create a file parser object and then cast the pointer to the object to a ViewerHandle. The other static functions will be passed this ViewerHandle and should cast it back to the object pointer. 

destroyViewer should delete whatever makeViewer created. 

readTextBlockshould read a block of text from the input and store it in a dtsTextBlock. The text should be stored using the output format and encoding specified by the file parser's ViewerInfoFlags and the outputFormat requested in the dtsMakeViewerParams. The size of the block read is up to the viewer. It must be less than the blockSize supplied in the viewer's dtsViewerInfo, since this is used to allocate the dtsTextBlock's buffer. When all of a file has been parsed, readTextBlock should return an empty text block (no text, zero length). 

gotoBookMark should reposition the parser's input pointer to the start of the text block identified by the dtsBookMark. If the dtsBookMark pointer is 0, rewind the input pointer to the start of the file. (Not needed for container parsers.) 

getFileInfo gets basic information about the document associated with handle. All of this will be available in the dtsInputStream. The parser can modify this information as needed. 

  1. Write a Register function to register the file parser, filling a dtsViewerInfo with recognition information and pointers to the static functions described above.
  2. Call the Register function after initializing the dtSearch Engine.

How to Write a Container File Parser 

A container file parser is essentially the same as a document file parser except that it must support the following additional functions: getCount, getInfoByName, getInfoByIndex, extractToMem, extractToFile. Containers should be able to extract either to memory or to a file, although only the latter is required. 

Container parsers should handle situations in which the information supplied is incorrect. For example, the name may designate a file that is no longer stored in the container. 

Container File Names 

When a document that is stored inside a container is retrieved in a search, the filename that is returned describes the path to the document through the containers in which it is found. The path consists of the name of the disk file where the container is stored followed by one or more strings identifying items to be extracted from a container. Each string consists of an ordinal (in hex), a comma, the type id of the container (also hex), a | delimiter, and a text identifier for the item. The strings are delimited with >. For example, if "docssmith.doc" is stored as the fourth item in "c:\zips\november.zip", the filename would be: 

c:\zips\november.zip>4,df|docssmith.doc 

Recognition of File Formats 

Each parser provides in its dtsViewerInfo at least one of the following:

  • A filename pattern
  • A "signature" providing text that identifies the file format.
  • A "recognize" function that, given a dtsInputStream, will identify whether the dtsInputStream has the file format.

The following is a detailed summary of how the dtSearch Engine uses information in the dtsViewerInfo:

  1. Create a dtsInputStream to access the document or container.
  2. If a type id was provided in the dtsInputStream, attempt to build the parser indicated by the type id and, if successful, return this parser.
  3. For each registered file parser, determine the confidence level for the match with the dtsInputStream. Confidence levels range from zero to 100 (dtsMaxConfidence). If the confidence level equals dtsMaxConfidence, try to build a parser right away. If the parser is built successfully, use it. If the confidence level is less than dtsMaxConfidence, record the parser and confidence level in a table of possible matches and continue.
  4. If a parser was not successfully built in step 2.b then iterate through the table of possible matches, starting with the highest confidence level returned, and give each a chance to build a parser from the dtsInputStream. Use the first one that succeeds.

The confidence level for a match between an dtsInputStream and a parser is determined as follows:

  1. Check the filenamePattern, if it is not NULL. If the name does not match, skip the parser. If the name does match, and neither a signature or a recognize() function is provided, use filenameConfidence as the confidence level. If the name matches and a signature or a recognize() function is provided, continue to step 2.
  2. Check the recognitionSignature, if its not empty. If the signature does not match, skip the parser. If the signature does match, and recognize() is NULL, the confidence value in recognitionSignature is used. If the signature does match and recognize() is not NULL, continue to step 3.
  3. Execute the function pointed to by recognize. The return value of recognize is the confidence level. Parsers should provide signatures whenever possible because they will be faster than the recognize() function.

Parsers are checked in the reverse order of their registration. External parsers, which will be registered last, can override internal parsers. 

Output Format 

A file parser can return data in formats other than plain text, and a file parser that returns the requested format will have precedence over a file parser that returns a different format. 

To indicate the file formats your parser can return, set one or more of the following flags in dtsViewerInfo.flags: viReturnsHtml, viReturnsRtf, or viUtf8CharSet. When your parser's makeViewer function is called, the dtsMakeViewerParams struct will indicate the requested output format in dtsMakeViewerParams.outputFormat. 

To ensure that hit highlighting is consistent, the text and word breaks must be identical regardless of the format returned. 

The File Type Table 

The File Type Table is an XML file that end-users can modify using dtSearch Desktop's Options > Preferences > File Types dialog box. In the developer API, the location of the File Type Table is provided in the FileTypeTableFile member of the Options object. 

The table contains a series of rules, each specifying a file type and a set of filename filters. When a set of filters is provided for a file type, that set of filters is used to detect files of that type in all dtSearch operations. Rules can operate in either of two ways: 

(1) Default rules apply only where another file parser does not match a file type with confidence greater than 50. For example, a default rule specifying that *.doc should be treated as XML would not override the MS Word file parser, which would still identify MS Word documents based on their binary header with confidence dtsMaxConfidence

(2) "Override" rules, indicated by the Flags field of a rule with a value of 1, override all other file parsers and apply unconditionally. For example, a rule specifying that *.html and *.htm should be indexed as Ansi text would prevent the HTML file parser from recognizing those files based on their header and would force the entire contents of HTML files to be indexed as plain text.

Module
Links
You are here: C++ API > File Parser API
Copyright (c) 1995-2008 dtSearch Corp. All rights reserved.