Close
dtSearch Text Retrieval Engine Programmer's Reference
dtsContainer Class

API for enumeration and extraction from container file formats.

File: ContainerApi.h

Syntax
C++
class dtsContainer;

The Container API provides access to the dtSearch Engine's container support for content extraction. 

Identifying items in a container 

Container file parsers generate a name and "index" for each item in a container. Both must be unique within the container. The index can be any 64-bit integer that can be used to efficiently retrieve an item from the container, such as a byte offset or record identifier. 

Zero-based container item ordinals are generated by the wrapper class as a convenience. It is possible to retrieve an item from a container in a way that makes the ordinal unknown (e.g., by name or index), in which case the ordinal value will be set to -1. Ordinals are only guaranteed to be known when retrieved through getFirst/getNext, or when readDirectory has been called. 

Count can also be -1 if not yet known. 

Navigating in a container 

These methods navigate in the container, leaving the seek position ready for an extraction operation: getFirst, getNext, getInfoByIndex, getInfoByName

Enumeration can only be done through getFirst/getNext calls, which read items sequentially from the container. The getInfoByIndex and getInfoByName methods can be used to seek directly to an item in the container, but they do not leave the container in a state that is ready for a getNext call. 

Directory methods 

The readDirectory method reads the entire directory of a container, if not already known, through getFirst/getNext calls, but callers should not assume anything about the resulting seek position of the container. 

Once the directory is read, you can then retrieve items using findInDirectoryOrdinal and findInDirectoryByName (which does an exact match), without affecting the seek status of the container. 

Performance costs 

The cost of enumerating a container, either through getFirst/getNext calls or by calling readDirectory, can vary greatly depending on the container type. 

The two factors affecting the cost are: (1) does the container have an efficient internal directory, and (2) how much work has to be done to extract the properties of each item in the container. 

PST files are costly to open because they are databases that require reading large dispersed tables to access anything. They are costly to enumerate because each message item is generated from a set of database records and blocks of data rather than existing as a single block of data. 

MBOX, CSV, DBF, ACCDB, and MDB files are quick to open but costly to enumerate because each item requires parsing the text of the item to determine its properties, and there is no internal directory. 

ZIP, RAR, and TAR files are quick to open and enumerate because they have an efficient internal directory, and item properties can be read from the directory without parsing the entire item. 

Sample code 

See the dtextract sample in examples\cpp\dtextract for sample code.

dtsContainer