Close
dtSearch Text Retrieval Engine Programmer's Reference
Filtering Options

The Unicode Filtering algorithm in the dtSearch Engine can be used to improve text extraction from binary files.

Overview

Binary files are files that dtSearch does not recognize as documents. Examples of binary files include executable programs, fragments of documents recovered through an undelete process, or blocks of unallocated or recovered data obtained through computer forensics. Content in these files may appear in a variety of formats, such as plain text, Unicode text, or fragments of .doc or .xls files. Many different fragments with different encodings may be present in the same binary file. 

Indexing such a file as if it were a simple text file would miss most of the content. In contrast to a simple text scan, the dtSearch filtering algorithm scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm can detect sequences of text with different encodings or formats in the same file, so as to better extract text from recovered or corrupt data. 

In forensic applications, when complete and accurate results are critical, investigators may be reluctant to enable a "filtering" feature out of concern that they will miss something, even if disabling filtering makes indexing slower. In reality, filtering improves completeness and accuracy, and without it investigators will probably miss much of the useful data in the files they are searching. 

For example, this is a hex view of how some text might appear in a fragment of a recovered Word document:

Offset 0 1 2 3 4 5 6 7 8 9 A B C D E F 00009C00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 00009C10 FF FF FF FF 73 65 63 72 65 74 31 FF FF FF FF FF ÿÿÿÿsecret1ÿÿÿÿÿ 00009C20 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 7F ÿÿÿì7“.....²... 00009C30 FF FF FF 7F EC 37 93 00 00 00 00 00 B2 00 00 00 ..ÿÿÿJohnSmithÿÿ 00009C40 00 00 FF FF FF 4A 6F 68 6E 53 6D 69 74 68 FF FF ................ 00009C50 FF FF 00 00 00 00 00 00 28 00 4D 00 61 00 6E 00 ÿÿ......(.M.a.n. 00009C60 61 00 67 00 69 00 6E 00 67 00 20 00 61 00 6E 00 a.g.i.n.g. .a.n. 00009C70 64 00 20 00 53 00 65 00 61 00 72 00 63 00 68 00 d. .S.e.a.r.c.h. 00009C80 69 00 6E 00 67 00 20 00 54 00 65 00 72 00 61 00 i.n.g. .T.e.r.a. 00009C90 62 00 79 00 74 00 65 00 73 00 20 00 6F 00 66 00 b.y.t.e.s. .o.f. 00009CA0 20 00 54 00 65 00 78 00 74 00 00 00 00 00 00 00 .T.e.x.t.......

All of the useful text actually present is broken up or embedded in garbage data, effectively making it unsearchable. An unfiltered attempt to index this data would find the following words:

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿsecret1ÿÿÿÿÿÿÿÿ, ì7, ÿÿÿJohnSmithÿÿÿÿ, M, a, n, a, …

The dtSearch filtering algorithm would analyze the data more intelligently, enabling it to

  • extract the word secret1 embedded in a long sequence of non-text characters,
  • extract and separate the names John and Smith, and
  • recognize that the data starting at offset 9C58 looks like Unicode, enabling it to identify the words Managing, Search, etc.

The dtSearch filtering algorithm works by analyzing the patterns of characters in the data. The dtSearch filtering algorithm makes no attempt to analyze the meaning of the language present, so the algorithm works with Arabic or Russian text, for example, as well as English.

Options

To enable the filtering algorithm for all unrecognized file types, set Options.BinaryFiles to dtsoFilterBinaryUnicode. 

Files larger than Options.AutoFilterSizeMB will be indexed using the filtering algorithm unconditionally, on the assumption that very large files are likely to be non-document data such as forensically-recovered disk images or slack space. The default value for AutoFilterSizeMB is 2048 (2 gigabytes), which is also the maximum value for this setting. 

The following options can be used to control the behavior of the filtering algorithm. The dtsoUf* values are members of the UnicodeFilterFlags enumeration, which is used in Options.UnicodeFilterFlags.

Option
Purpose
UnicodeFilterBlockSize
Large files are divided into chunks of this size during indexing.
UnicodeFilterMinTextSize
This option specifies how many text characters must occur consecutively for a block of text to be included. At the default value, 6, a series of 5 text characters surrounded by non-text data would be filtered out.
UnicodeFilterRanges
UnicodeFilterRanges indicates the Unicode subranges that the filtering algorithm should look for. For example, if UnicodeFilterRanges is set to 1 and 8, then the filtering algorithm will look for characters from U+0100-U+01FF and U+0800-U+08FF
This is used to help the filtering algorithm to distinguish text from non-text data. It is only used as a hint in the algorithm, so if the text extraction algorithm detects text in another language with a sufficient level of confidence, it will return that text even if the language was not selected.
In .NET and COM, UnicodeFilterRanges is a comma-separated list of integers, each from 0 to 255, indicating the Unicode subranges that the filtering algorithm should look for. Example: "1,8"
In the C++ API, a 256-byte array is used to specify the ranges, with each byte set to a nonzero value to indicate that the corresponding range should be included.
UnicodeFilterWordOverlapAmount
Unicode Filtering can automatically break long runs of letters into words each time more than Options.MaxWordLength consecutive letters are found. By default, a word break is inserted and the next word starts with the following character. Set UnicodeFilterWordOverlapAmount and also set the dtsoUfAutoWordBreakOverlapWords flag in UnicodeFilterFlags to start the next word before the end of the previous word.
For example, suppose the maximum word length is set to 8, and the following run of letters is found: aaaaahiddenaaaaa. By default, this would be indexed as aaaaahid and denaaaa, which means that a search for *hidden* would not find it. With a word overlap of 4, this would be indexed as: aaaaahid, ahiddena, denaaaaa which would allow the embedded word "hidden" to be found in a search for *hidden*.
dtsoUfExtractAsHtml
Extracting blocks as HTML has no effect on the text that is extracted, but it adds additional information in HTML comments to each extracted block.
dtsoUfOverlapBlocks
Overlapping blocks prevents text that crosses a block boundary from being missed in the filtering process.
dtsoUfAutoWordBreakByLength
Automatically insert a word break in long sequences of letters.
dtsoUfAutoWordBreakByCase
Automatically insert a word break in long sequences of letters.
dtsoUfAutoWordBreakOnDigit
Automatically insert a word break when a digit follows letters.
dtsoUfAutoWordBreakOverlapWords
When a word break is automatically inserted due to dtsoUfAutoWordBreakByLength, overlap the two words generated by the word break.
dtsoUfFilterFailedDocs
When a document cannot be indexed due to file corruption or encryption, apply the filtering algorithm to extract text from the file.
dtsoUfFilterAllDocs
Ignore file format information and apply Unicode Filtering to all documents.
Generated filenames

Each block extracted from a file is given a filename based on the original document, the block number, the range of bytes in the file, and the language settings. Example:

sample.bin #16 @4194303 - 4456704 (0, 1, 2)

This name identifies the 16th block extracted from sample.bin, covering the range of data from offsets 4194303 to 4456704 in the input file. The numbers in parenthesis encode the language settings used to extract the text from this block.