The Unicode Filtering algorithm in the dtSearch Engine can be used to improve text extraction from binary files.
Binary files are files that dtSearch does not recognize as documents. Examples of binary files include executable programs, fragments of documents recovered through an undelete process, or blocks of unallocated or recovered data obtained through computer forensics. Content in these files may appear in a variety of formats, such as plain text, Unicode text, or fragments of .doc or .xls files. Many different fragments with different encodings may be present in the same binary file.
Indexing such a file as if it were a simple text file would miss most of the content. In contrast to a simple text scan, the dtSearch filtering algorithm scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm can detect sequences of text with different encodings or formats in the same file, so as to better extract text from recovered or corrupt data.
In forensic applications, when complete and accurate results are critical, investigators may be reluctant to enable a "filtering" feature out of concern that they will miss something, even if disabling filtering makes indexing slower. In reality, filtering improves completeness and accuracy, and without it investigators will probably miss much of the useful data in the files they are searching.
For example, this is a hex view of how some text might appear in a fragment of a recovered Word document:
All of the useful text actually present is broken up or embedded in garbage data, effectively making it unsearchable. An unfiltered attempt to index this data would find the following words:
The dtSearch filtering algorithm would analyze the data more intelligently, enabling it to
- extract the word secret1 embedded in a long sequence of non-text characters,
- extract and separate the names John and Smith, and
- recognize that the data starting at offset 9C58 looks like Unicode, enabling it to identify the words Managing, Search, etc.
The dtSearch filtering algorithm works by analyzing the patterns of characters in the data. The dtSearch filtering algorithm makes no attempt to analyze the meaning of the language present, so the algorithm works with Arabic or Russian text, for example, as well as English.
To enable the filtering algorithm for all unrecognized file types, set Options.BinaryFiles to dtsoFilterBinaryUnicode.
Files larger than Options.AutoFilterSizeMB will be indexed using the filtering algorithm unconditionally, on the assumption that very large files are likely to be non-document data such as forensically-recovered disk images or slack space. The default value for AutoFilterSizeMB is 2048 (2 gigabytes), which is also the maximum value for this setting.
The following options can be used to control the behavior of the filtering algorithm. The dtsoUf* values are members of the UnicodeFilterFlags enumeration, which is used in Options.UnicodeFilterFlags.
Option |
Purpose |
UnicodeFilterBlockSize |
Large files are divided into chunks of this size during indexing. |
UnicodeFilterMinTextSize |
This option specifies how many text characters must occur consecutively for a block of text to be included. At the default value, 6, a series of 5 text characters surrounded by non-text data would be filtered out. |
UnicodeFilterRanges |
UnicodeFilterRanges indicates the Unicode subranges that the filtering algorithm should look for. For example, if UnicodeFilterRanges is set to 1 and 8, then the filtering algorithm will look for characters from U+0100-U+01FF and U+0800-U+08FF This is used to help the filtering algorithm to distinguish text from non-text data. It is only used as a hint in the algorithm, so if the text extraction algorithm detects text in another language with a sufficient level of confidence, it will return that text even if the language was not selected. In .NET and COM, UnicodeFilterRanges is a comma-separated list of integers, each from 0 to 255, indicating the Unicode subranges that the filtering algorithm should look for. Example: "1,8" In the C++ API, a 256-byte array is used to specify the ranges, with each byte set to a nonzero value to indicate that the corresponding range should be included. |
UnicodeFilterWordOverlapAmount |
Unicode Filtering can automatically break long runs of letters into words each time more than Options.MaxWordLength consecutive letters are found. By default, a word break is inserted and the next word starts with the following character. Set UnicodeFilterWordOverlapAmount and also set the dtsoUfAutoWordBreakOverlapWords flag in UnicodeFilterFlags to start the next word before the end of the previous word. For example, suppose the maximum word length is set to 8, and the following run of letters is found: aaaaahiddenaaaaa. By default, this would be indexed as aaaaahid and denaaaa, which means that a search for *hidden* would not find it. With a word overlap of 4, this would be indexed as: aaaaahid, ahiddena, denaaaaa which would allow the embedded word "hidden" to be found in a search for *hidden*. |
dtsoUfExtractAsHtml |
Extracting blocks as HTML has no effect on the text that is extracted, but it adds additional information in HTML comments to each extracted block. |
dtsoUfOverlapBlocks |
Overlapping blocks prevents text that crosses a block boundary from being missed in the filtering process. |
dtsoUfAutoWordBreakByLength |
Automatically insert a word break in long sequences of letters. |
dtsoUfAutoWordBreakByCase |
Automatically insert a word break in long sequences of letters. |
dtsoUfAutoWordBreakOnDigit |
Automatically insert a word break when a digit follows letters. |
dtsoUfAutoWordBreakOverlapWords |
When a word break is automatically inserted due to dtsoUfAutoWordBreakByLength, overlap the two words generated by the word break. |
dtsoUfFilterFailedDocs |
When a document cannot be indexed due to file corruption or encryption, apply the filtering algorithm to extract text from the file. |
dtsoUfFilterAllDocs |
Ignore file format information and apply Unicode Filtering to all documents. |
Each block extracted from a file is given a filename based on the original document, the block number, the range of bytes in the file, and the language settings. Example:
This name identifies the 16th block extracted from sample.bin, covering the range of data from offsets 4194303 to 4456704 in the input file. The numbers in parenthesis encode the language settings used to extract the text from this block.