Filtering Options

Menu option: Options > Preferences > Filtering options

Binary files
A binary file is a file that has a format dtSearch cannot recognize and that does not appear to be a plain text file. Use the Binary files setting to specify whether you want dtSearch to index these files as plain text, skip them entirely, filter out only the text, or index only the filenames.

Exclude filter list for new indexes
When an index is created, dtSearch will use this option setting to initialize the list of filename filters to be excluded from the index.

Advanced Filtering Options

Binary files are files that dtSearch does not recognize as documents. Examples of binary files include executable programs, fragments of documents recovered through an "undelete" process, or blocks of unallocated or recovered data obtained through computer forensics.  Content in these files may be stored in a variety of formats, such as plain text, Unicode text, or fragments of .DOC or .XLS files.  Many different fragments with different encodings may be present in the same binary file.  Indexing such a file as if it were a simple text file would miss most of the content.

The dtSearch filtering algorithm scans a binary file for anything that looks like text using multiple encoding detection methods.  The algorithm can detect sequences of text with different encodings or formats in the same file, so it is much better able to extract content from recovered or corrupt data than a simple text scan.

Each binary file is first divided into blocks, and then the text is extracted from each block using the Advanced filtering options settings.  Each block is given a filename based on the original document, the block number, the range of bytes in the file, and the language settings.  Example:

sample.bin #16 @4194303 - 4456704 (0, 1, 2)
 

This name identifies the 16th block extracted from sample.bin, covering the range of data from offsets 4194303 to 4456704 in the input file.  The numbers in parenthesis encode the language settings used to extract the text from this block.

The options described below apply only to text that is indexed as binary data using the filtering algorithm.  These options have no effect on indexing text in recognized document formats such as Word, Excel, PDF, etc.

Languages to include
The
Languages to include setting is used to help the filtering algorithm to distinguish text from non-text data.  It is only used as a hint in the algorithm, so if the text extraction algorithm detects text in another language with a sufficient level of confidence, it will return that text even if the language was not selected.

Block size
The
Block size setting specifies how each input file is divided into blocks before being filtered.  For example, if you specify a block size of 100 kilobytes, then a 1000 kilobyte file would be indexed as 10 separate blocks.  Very large block sizes can make extraction of documents slower after a search (because more data has to be extracted to view a block), so block sizes over 1 Mb are not recommended.

Overlap blocks
Overlapping blocks prevents text that crosses a block boundary from being missed in the filtering process.  With overlapping enabled, each block extends 256 characters past the start of the previous block.

Extract blocks as HTML
Extracting blocks as HTML has no effect on the text that is extracted, but it adds additional information in HTML comments to each extracted block.  The HTML comments identify the starting byte offset and encoding of each piece of text extracted from a file. To see the comments, right-click anywhere in the text of a block that was retrieved in a search and select "View source".

Minimum size of text segments
The minimum text segment size specifies how many text characters must occur consecutively for a block of text to be included. At the default value, 6, a series of 5 text characters surrounded by non-text data would be filtered out.   

Allow filter to insert word breaks
The filter can automatically insert word breaks where appropriate (for example, where there is a lower-case letter followed by a capital letter) and break up very long consecutive streams of letters.

Use filtering to index corrupt or encrypted documents
This option applies the filtering algorithm to attempt to recover text from corrupt or encrypted documents, instead of just skipping these files during indexing.   (By default, dtSearch will skip documents that are corrupt or encrypted, and will report a list of these files in the index update log.  Only unencrypted text will be recovered from encrypted documents.)  

Use filtering to index all documents
This option applies the filtering algorithm to index all documents, whether or not they appear to have a recognizable file format.  This option is not recommended for most users.  It will cause dtSearch to scan all files for segments of recognizable text, using the filtering algorithm only.  This type of scan can find data that was intentionally hidden or accidentally left in documents such as text in unused streams in Microsoft Word or Excel files.  However, this type of scan will miss data that is only accessible through a file format-aware scan of a document, such as compressed data in a PDF file.  Therefore, this option should only be used in combination with a standard file format-aware index.