Menu option: Options > Preferences > Filtering options

Binary Files
A binary file is a file that
has a format dtSearch cannot recognize and that does not appear to be
a plain text file. Use the "Binary files" setting to specify
whether you want dtSearch to index these files as plain text, skip them
entirely, or to filter out only the text of binary files. See
"Advanced Filtering Options" below for information on how filtering
is done.
Exclude filter list for new
indexes
When an index is created, dtSearch will use this option setting to initialize
the list of filename filters to be excluded
from the index.
Advanced Filtering Options
Binary files are files that dtSearch does not recognize as documents. Examples of binary files include executable programs, fragments of documents recovered through an "undelete" process, or blocks of unallocated or recovered data obtained through computer forensics. Content in these files may be stored in a variety of formats, such as plain text, Unicode text, or fragments of .DOC or .XLS files. Many different fragments with different encodings may be present in the same binary file. Indexing such a file as if it were a simple text file would miss most of the content.
The dtSearch filtering algorithm scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm can detect sequences of text with different encodings or formats in the same file, so it is much better able to extract content from recovered or corrupt data than a simple text scan. Input files can be larger than 4 Gb in size. The filtering algorithm is the same one used in the dtSearch ExText utility.
Each binary file is first divided into blocks, and then the text is extracted from each block using the "Advanced filtering options" settings. Each block is given a filename based on the original document, the block number, the range of bytes in the file, and the language settings. Example:
sample.bin #16 @4194303 - 4456704 (0, 1, 2)
This name identifies the 16th block extracted from sample.bin, covering the range of data from offsets 4194303 to 4456704 in the input file. The numbers in parenthesis encode the language settings used to extract the text from this block.
Languages to include
The Languages to include setting is used to help the filtering algorithm
to distinguish text from non-text data. It
is only used as a hint in the algorithm, so if the text extraction algorithm
detects text in another language with a sufficient level of confidence,
it will return that text even if the language was not selected.
Block size
The Block size
setting specifies how each input file is divided into blocks before being
filtered. For
example, if you specify a block size of 100 kilobytes, then a 1000 kilobyte
file would be indexed as 10 separate blocks.
Overlap blocks
Overlapping blocks prevents text that crosses a block boundary from being
missed in the filtering process. With
overlapping enabled, each block extends 256 characters past the start
of the previous block.
Extract blocks as HTML
Extracting blocks as HTML has no effect on the text that is extracted,
but it adds additional information in HTML comments to each extracted
block. The
HTML comments identify the starting byte offset and encoding of each piece
of text extracted from a file. To see the comments, right-click anywhere
in the text of a block that was retrieved in a search and select "View
source".
Minimum text segment size
The minimum text segment size specifies how many text characters must occur
consecutively for a block of text to be included. At
the default value, 6, a series of 5 text characters surrounded by non-text
data would be filtered out.
Allow filter to insert word breaks
The filter can automatically insert word breaks where appropriate (for
example, where there is a lower-case letter followed by a capital letter)
and to break up very long consecutive streams of letters.
Use filtering to index corrupt or encrypted
documents
Apply the filtering algorithm to attempt to recover text from corrupt or
encrypted documents, instead of just skipping these files during indexing.
(By
default, dtSearch will skip documents that are corrupt or encrypted, and
will report a list of these files in the index update log.)
Use filtering to index all documents
Apply the filtering algorithm to index all documents, whether or not they
appear to have a recognizable file format. This
option is not recommended for most users. It
will cause dtSearch to scan all files for segments of recognizable text,
using the filtering algorithm only. This
type of scan can find data that was intentionally hidden or accidentally
left in documents such as text in unused streams in Microsoft Word or
Excel files. However,
this type of scan will miss data that is only accessible through a file
format-aware scan of a document, such as compressed data in a PDF file.
Therefore,
therefore should only be used in combination with a standard file format-aware
index.
Recognition of Binary Files
dtSearch will apply the binary filtering algorithm to a file that (a) does not match any of the document formats that dtSearch recognizes, and (b) does not appear to be a plain text file. Using the File types settings, you can specify that other files must also be indexed using the binary filtering algorithm. To do this,
1. Click Options > Preferences > File types
2. Click New... to create a new file type rule, and provide a name for the rule
3. Under File type, select Filtered Binary.
4. Under Filename filters, enter a filename filter to identify which files the rule will apply to.
5. Check the Override all other file type detection methods for these files box. This will make the rule apply to all files covered by the filename filter, even if they appear to have a recognized format.