dtsContainer Class

API for enumeration and extraction from container file formats.

File

File: ContainerApi.h

dtsContainer Constructor

Syntax

C++

class dtsContainer;

Group

Classes

Members

Methods

Method	Description
close	Closes the currently open container
dtsContainer(const dtsContainer&)	Non-copyable
dtsContainer(dtsContainer&&)	Movable
extractToFile	Extracts the current item to a file
extractToMemory	Extracts the current item to a memory buffer
findInDirectoryByName	Find an item in the directory by name position without seeking in the container.
findInDirectoryByOrdinal	Find an item in the directory by zero-based ordinal position without seeking in the container.
getContainerTypeId	Returns the TypeId of the container
getContainerTypeName	Returns the name of the container type (e.g., "ZIP", "RAR", "PST")
getCount	Returns the number of items in the container
getErrorCode	Returns the error code from the last operation, or 0 if no error
getErrorMessage	Returns the error message from the last operation, or nullptr if no error
getFirst	Seeks to the first item in the container
getInfoByIndex	Seeks to an item in the container by index.
getInfoByName	Seeks to an item in the container by name.
getNext	Seeks to the next item in the container
good	Checks if the container is in a valid state
openFile	Opens a container file
openStream	Opens a container from a dtsInputStream
readDirectory	Iterate over the entire container and read the directory of items.

Methods

Method	Description
close	Closes the currently open container
dtsContainer(const dtsContainer&)	Non-copyable
dtsContainer(dtsContainer&&)	Movable
extractToFile	Extracts the current item to a file
extractToMemory	Extracts the current item to a memory buffer
findInDirectoryByName	Find an item in the directory by name position without seeking in the container.
findInDirectoryByOrdinal	Find an item in the directory by zero-based ordinal position without seeking in the container.
getContainerTypeId	Returns the TypeId of the container
getContainerTypeName	Returns the name of the container type (e.g., "ZIP", "RAR", "PST")
getCount	Returns the number of items in the container
getErrorCode	Returns the error code from the last operation, or 0 if no error
getErrorMessage	Returns the error message from the last operation, or nullptr if no error
getFirst	Seeks to the first item in the container
getInfoByIndex	Seeks to an item in the container by index.
getInfoByName	Seeks to an item in the container by name.
getNext	Seeks to the next item in the container
good	Checks if the container is in a valid state
openFile	Opens a container file
openStream	Opens a container from a dtsInputStream
readDirectory	Iterate over the entire container and read the directory of items.

Remarks

The Container API provides access to the dtSearch Engine's container support for content extraction.

Identifying items in a container

Container file parsers generate a name and "index" for each item in a container. Both must be unique within the container. The index can be any 64-bit integer that can be used to efficiently retrieve an item from the container, such as a byte offset or record identifier.

Zero-based container item ordinals are generated by the wrapper class as a convenience. It is possible to retrieve an item from a container in a way that makes the ordinal unknown (e.g., by name or index), in which case the ordinal value will be set to -1. Ordinals are only guaranteed to be known when retrieved through getFirst/getNext, or when readDirectory has been called.

Count can also be -1 if not yet known.

Navigating in a container

These methods navigate in the container, leaving the seek position ready for an extraction operation: getFirst, getNext, getInfoByIndex, getInfoByName.

Enumeration can only be done through getFirst/getNext calls, which read items sequentially from the container. The getInfoByIndex and getInfoByName methods can be used to seek directly to an item in the container, but they do not leave the container in a state that is ready for a getNext call.

Directory methods

The readDirectory method reads the entire directory of a container, if not already known, through getFirst/getNext calls, but callers should not assume anything about the resulting seek position of the container.

Once the directory is read, you can then retrieve items using findInDirectoryOrdinal and findInDirectoryByName (which does an exact match), without affecting the seek status of the container.

Performance costs

The cost of enumerating a container, either through getFirst/getNext calls or by calling readDirectory, can vary greatly depending on the container type.

The two factors affecting the cost are: (1) does the container have an efficient internal directory, and (2) how much work has to be done to extract the properties of each item in the container.

PST files are costly to open because they are databases that require reading large dispersed tables to access anything. They are costly to enumerate because each message item is generated from a set of database records and blocks of data rather than existing as a single block of data.

MBOX, CSV, DBF, ACCDB, and MDB files are quick to open but costly to enumerate because each item requires parsing the text of the item to determine its properties, and there is no internal directory.

ZIP, RAR, and TAR files are quick to open and enumerate because they have an efficient internal directory, and item properties can be read from the directory without parsing the entire item.

Sample code

See the dtextract sample in examples\cpp\dtextract for sample code.

Class Hierarchy

dtsContainer