Extext - dtSearch Text Extraction Utility

 

Extext.exe is a tool for extracting text from large binary files, such as undeleted data recovered from a hard disk.  It converts the input data to a series of small Unicode text or HTML files containing extracted text.  Extext assumes that the input data consists of fragments of files rather than a single complete document and so looks for sequences of data that appear to be text, Unicode text, or UTF-8 text.

Input Files
Use Add Files to select one or more binary files to process.  Add folder will add all files in a folder tree.  You can also drag and drop files onto the Extext dialog box.  Each file can be up to 2 Gb in length.  There is no limit on the total size of the input files.

 

Output folder for extracted text files
Extracted text files will be written to this folder.  Each output file will be named after the input file, with a number appended to the end.

 

Input chunk size (KB)
The input chunk size controls how many files will be created from each input file.  For example, if the input is a single 500 MB binary file, and the input chunk size is 1024 KB (1 MB), then 500 output files will be created, one for each megabyte of the input.

 

Type of output to create
Filtered text can be written either as Unicode Text files or as HTML files. Both formats can hold Unicode data.  HTML files include, in front of each extracted sequence of text, an HTML comment identifying where in the input file the data was found, and how it was stored in the original.  Example:

 

<!-- @00072a5c Unicode--> New Zealand

 

This comment indicates that the Unicode text "New Zealand" was found at byte offset 72a5c in the original data.  Because this information is stored in a comment, it is not visible when you open the HTML file in a browser, and it will not affect indexing or searching.  To see the HTML comments, open the HTML file in a text editor like Notepad.

 

The "No filtering" option lets you use Extext as a simple file splitter.  It will break each file into smaller chunks according to the input chunk size, without modifying the data in any way.

 

Languages to include
Selecting languages to include in the filtering helps Extext to separate valid Unicode text from random binary data.  For example, if you select "Arabic", Extext will look for sequences of Unicode characters in the 0x0600-0x06ff range.  Extext will use the language selections to help it to find Unicode text, but it may still report some text in other languages that appears to be present in the input data.