File Segmentation

Menu option: Options > Preferences > File segmentation

The File Segmentation Rules dialog box provides a way to tell dtSearch that certain text files should be indexed as many subdocuments instead of treating each file as a single large document.  

The maximum supported size for each individual segment is 16 Mb.

You can set up any number of rules specifying how groups of files will be subdivided.  Each rule includes the following elements:

Rule name
The name of a rule is used only to identify it in the File Segmentation rules dialog box.

New document starts at
This is a marker that indicates when a new document begins.  For email message files, this is often part of a message header such as "Date:" or "From:".  To avoid incorrectly splitting a message, this marker should be as unique as possible.

How to check for document boundaries in text
Each line of a file will be compared against the marker under New document starts at.  Three types of comparison are available:

Require exact match -- The entire line must exactly match the marker.

Match start of line -- The start of the line must match the marker.

Match regular expression -- The marker is interpreted as a regular expression.  A document boundary occurs when the marker is found anywhere in a line.  To require a marker to begin at the start of a line, precede it with the ^ character.  

Ignore case
Match a document boundary even if the capitalization does not match.

First segment in a file is header for other segments
Check this box to have dtSearch insert the first segment in a file in every following segment.  This option is useful when segmenting XML or HTML files, because it allows the HTML or XML header to be repeated for each segment.

Filename filters
For each rule, a filename filter determines which files the rule applies to.  If more than one rule could apply to a particular file, the first one to match the filename is the one applied.

Documents processed with file segmentation must be text files, XML, or HTML.  If you use file segmentation with XML or HTML files, use the First segment in a file is header for other segments checkbox to make sure that the XML or HTML header is repeated for each segment.

In search results, each subdocument in a segmented document will have a name that identifies the location of the subdocument in its disk file.


Copyright © 1991-2021 dtSearch Corp. All Rights Reserved.  /  Terms of use  /  Privacy