You are here: C++ API > File Parser API > Document File Parsers
Close
dtSearch Text Retrieval Engine Programmer's Reference
Document File Parsers

A document file parser translates an input document into the requested output format.

How the dtSearch Engine Works with Document Parsers

For each file that the dtSearch Engine indexes or searches, the following process is used:

  1. Identify the file parser that should process the file, based on information in the recognitionSignature element of the file parser's dtsViewerInfo and the requested output format (see Output Formats below).
  2. Call the makeViewer function supplied in the dtsViewerInfo to build a file parser object.
  3. Using the handle returned by makeViewer, call the getFileInfo function to get basic information about the file (size, date, name). If the size is 0 the file will be ignored.
  4. Repeatedly call the getTextBlock function to get text to index until getTextBlock returns a block with a length of 0. The Engine may call gotoBookMark with a bookmark pointer of 0 to rewind the file. getTextBlock calls after a rewind should start returning text from the beginning of the file again.
  5. Call destroyViewer to destroy the object identified by the handle returned by makeViewer.
How to Write a Document File Parser
  1. Determine how the file parser will recognize file formats that it handles. The recommended way to recognize a file format is by a unique signature in the beginning of a file. Fill in the recognitionSignature and recognize members of the dtsViewerInfo with the information needed to recognize the file format that the parser will handle.
  2. Implement an object that parses the file format into text. The constructor for the object should take a dtsInputStream. The object should attach the dtsInputStream to a dtsInputStreamReader, which the object should use to read data from the document.
  3. Write static functions to handle each of the callbacks needed by the Engine.

makeViewer2 should create an instance of the file parser object and should return a ViewerHandle identifying the object. The easiest way to do this is to use new to create a file parser object and then cast the pointer to the object to a ViewerHandle. The other static functions will be passed this ViewerHandle and should cast it back to the object pointer. 

destroyViewer should delete whatever makeViewer created. 

readTextBlock should read a block of text from the input and store it in a dtsTextBlock. The text should be stored using the output format and encoding specified by the file parser's ViewerInfoFlags and the outputFormat requested in the dtsMakeViewerParams. The size of the block read is up to the viewer. It must be less than the blockSize supplied in the viewer's dtsViewerInfo, since this is used to allocate the dtsTextBlock's buffer. When all of a file has been parsed, readTextBlock should return an empty text block (no text, zero length). 

gotoBookMark should reposition the parser's input pointer to the start of the text block identified by the dtsBookMark. If the dtsBookMark pointer is 0, rewind the input pointer to the start of the file. (Not needed for container parsers.) 

getFileInfo gets basic information about the document associated with handle. All of this will be available in the dtsInputStream. The parser can modify this information as needed. 

  1. Write a Register function to register the file parser, filling a dtsViewerInfo with recognition information and pointers to the static functions described above.
  2. Call the Register function after initializing the dtSearch Engine.
Copyright (c) 1995-2021 dtSearch Corp. All rights reserved.