API for enumeration and extraction from container file formats.
File: ContainerApi.h
|
Method |
Description |
|---|---|
|
Closes the currently open container | |
|
Non-copyable | |
|
Movable | |
|
Extracts the current item to a file | |
|
Extracts the current item to a memory buffer | |
|
Find an item in the directory by name position without seeking in the container. | |
|
Find an item in the directory by zero-based ordinal position without seeking in the container. | |
|
Returns the TypeId of the container | |
|
Returns the name of the container type (e.g., "ZIP", "RAR", "PST") | |
|
Returns the number of items in the container | |
|
Returns the error code from the last operation, or 0 if no error | |
|
Returns the error message from the last operation, or nullptr if no error | |
|
Seeks to the first item in the container | |
|
Seeks to an item in the container by index. | |
|
Seeks to an item in the container by name. | |
|
Seeks to the next item in the container | |
|
Checks if the container is in a valid state | |
|
Opens a container file | |
|
Opens a container from a dtsInputStream | |
|
Iterate over the entire container and read the directory of items. |
|
Method |
Description |
|---|---|
|
Closes the currently open container | |
|
Non-copyable | |
|
Movable | |
|
Extracts the current item to a file | |
|
Extracts the current item to a memory buffer | |
|
Find an item in the directory by name position without seeking in the container. | |
|
Find an item in the directory by zero-based ordinal position without seeking in the container. | |
|
Returns the TypeId of the container | |
|
Returns the name of the container type (e.g., "ZIP", "RAR", "PST") | |
|
Returns the number of items in the container | |
|
Returns the error code from the last operation, or 0 if no error | |
|
Returns the error message from the last operation, or nullptr if no error | |
|
Seeks to the first item in the container | |
|
Seeks to an item in the container by index. | |
|
Seeks to an item in the container by name. | |
|
Seeks to the next item in the container | |
|
Checks if the container is in a valid state | |
|
Opens a container file | |
|
Opens a container from a dtsInputStream | |
|
Iterate over the entire container and read the directory of items. |
The Container API provides access to the dtSearch Engine's container support for content extraction.
Identifying items in a container
Container file parsers generate a name and "index" for each item in a container. Both must be unique within the container. The index can be any 64-bit integer that can be used to efficiently retrieve an item from the container, such as a byte offset or record identifier.
Zero-based container item ordinals are generated by the wrapper class as a convenience. It is possible to retrieve an item from a container in a way that makes the ordinal unknown (e.g., by name or index), in which case the ordinal value will be set to -1. Ordinals are only guaranteed to be known when retrieved through getFirst/getNext, or when readDirectory has been called.
Count can also be -1 if not yet known.
Navigating in a container
These methods navigate in the container, leaving the seek position ready for an extraction operation: getFirst, getNext, getInfoByIndex, getInfoByName.
Enumeration can only be done through getFirst/getNext calls, which read items sequentially from the container. The getInfoByIndex and getInfoByName methods can be used to seek directly to an item in the container, but they do not leave the container in a state that is ready for a getNext call.
Directory methods
The readDirectory method reads the entire directory of a container, if not already known, through getFirst/getNext calls, but callers should not assume anything about the resulting seek position of the container.
Once the directory is read, you can then retrieve items using findInDirectoryOrdinal and findInDirectoryByName (which does an exact match), without affecting the seek status of the container.
Performance costs
The cost of enumerating a container, either through getFirst/getNext calls or by calling readDirectory, can vary greatly depending on the container type.
The two factors affecting the cost are: (1) does the container have an efficient internal directory, and (2) how much work has to be done to extract the properties of each item in the container.
PST files are costly to open because they are databases that require reading large dispersed tables to access anything. They are costly to enumerate because each message item is generated from a set of database records and blocks of data rather than existing as a single block of data.
MBOX, CSV, DBF, ACCDB, and MDB files are quick to open but costly to enumerate because each item requires parsing the text of the item to determine its properties, and there is no internal directory.
ZIP, RAR, and TAR files are quick to open and enumerate because they have an efficient internal directory, and item properties can be read from the directory without parsing the entire item.
Sample code
See the dtextract sample in examples\cpp\dtextract for sample code.