You are here: C++ API > File Parser API > File Format Recognition
Close
dtSearch Text Retrieval Engine Programmer's Reference
File Format Recognition

Each file parser provides a dtsViewerInfo that describes which documents that file parser should handle.

Each parser provides in its dtsViewerInfo at least one of the following:

  • A filename pattern
  • A "signature" providing text that identifies the file format.
  • A "recognize" function that, given a dtsInputStream, will identify whether the dtsInputStream has the file format.

The following is a detailed summary of how the dtSearch Engine uses information in the dtsViewerInfo:

  1. Create a dtsInputStream to access the document or container.
  2. If a type id was provided in the dtsInputStream, attempt to build the parser indicated by the type id and, if successful, return this parser.
  3. For each registered file parser, determine the confidence level for the match with the dtsInputStream. Confidence levels range from zero to 100 (dtsMaxConfidence). If the confidence level equals dtsMaxConfidence, try to build a parser right away. If the parser is built successfully, use it. If the confidence level is less than dtsMaxConfidence, record the parser and confidence level in a table of possible matches and continue.
  4. If a parser was not successfully built in step 2.b then iterate through the table of possible matches, starting with the highest confidence level returned, and give each a chance to build a parser from the dtsInputStream. Use the first one that succeeds.

The confidence level for a match between an dtsInputStream and a parser is determined as follows:

  1. Check the filenamePattern, if it is not NULL. If the name does not match, skip the parser. If the name does match, and neither a signature or a recognize() function is provided, use filenameConfidence as the confidence level. If the name matches and a signature or a recognize() function is provided, continue to step 2.
  2. Check the recognitionSignature, if its not empty. If the signature does not match, skip the parser. If the signature does match, and recognize() is NULL, the confidence value in recognitionSignature is used. If the signature does match and recognize() is not NULL, continue to step 3.
  3. Execute the function pointed to by recognize. The return value of recognize is the confidence level. Parsers should provide signatures whenever possible because they will be faster than the recognize() function.

Parsers are checked in the reverse order of their registration. External parsers, which will be registered last, can override internal parsers.

Copyright (c) 1995-2021 dtSearch Corp. All rights reserved.