dtSearch Text Retrieval Engine Programmer's Reference
Container File Types

dtSearch indexes some file types such as .zip or Microsoft Access (*.mdb) as containers, generating multiple documents for each file.

Container Filenames

When a document that is stored inside a container is retrieved in a search, the filename that is returned describes the path to the document through the containers in which it is found. The path consists of the name of the disk file where the container is stored followed by one or more strings identifying items to be extracted from a container. Each string consists of an ordinal (in hex), a comma, the type id of the container (also hex), a | delimiter, and a text identifier for the item. The strings are delimited with >. For example, if "smith.doc" is stored as the fourth item in "c:\zips\", the filename would be: 


Nested containers can result in multiple levels of container expressions in a filename.

Container File Types

The file formats that are treated as containers in dtSearch include: 





Microsoft Access (MDB and ACCDB) 

MBOX message archives 

Outlook Express DBX 

Outlook PST 

Additionally, files indexed using the Unicode Filtering algorithm, which extracts segments of text from data files in unrecognized binary formats, can be treated as containers if they are longer than the Options.UnicodeFilterBlockSize setting.

Processing Containers with FileConverter

The FileConverter object knows how to extract items from a container, so if you pass in a container filename such as c:\zips\>4,df|smith.doc as the InputFile, FileConverter will extract the file from the ZIP and then apply the conversion to the extracted file. 

You can also use FileConverter to recursively unpack and convert all items in a container, using the dtsConvertInlineContainer flag. This option generates a single output stream from a container file, including items that may be nested many layers within the container, such as a document inside a ZIP file that is inside another ZIP file.