|
dtSearch Support |
Last Reviewed: November 14, 2007
Article: DTS0170
Applies to: dtSearch 6.04 or later
Long text files often consist of many subsections that can be considered to be separate documents. For example, some email programs keep messages in a single long text file, with a marker such as "Date:" or a line of dashes separating the messages. Also, XML files with data converted from a database table will typically contain a series of records. The File Segmentation Rules feature in dtSearch provides a way to tell dtSearch to index each message or record as a separate document, without breaking up the original text file into numerous tiny files.
File Segmentation Rules work with text files, HTML, and XML. PDF files and word processor files such as Word or WordPerfect documents cannot be segmented using this method.
To set up a file segmentation rule, click Options > Preferences in dtSearch and select the "File Segmentation Rules" tab. Each rule has the following parts:
Name
The name of a rule is used only to identify it in the File Segmentation
Rules dialog box.
New document starts
at
This is a marker that indicates when a new document begins. For email message
files, this is often part of a message header such as "Date:"
or "From:". To avoid incorrectly splitting a message, this marker
should be as unique as possible.
How to check for document
boundaries in text
Each line of the files a rule applies to will be compared against the marker
under New document starts at.
Three types of comparison are available:
Require exact match The entire line must exactly match the marker.
Match start of line The start of the line must match the marker.
Match regular expression The marker is interpreted as a regular expression. A document boundary occurs when the marker is found anywhere in a line. To require a marker to begin at the start of a line, precede it with the ^ character.
Ignore case
Match a document boundary even if the capitalization does not match.
First segment in a file
is header for other segments
Check this box to have dtSearch insert the first segment in a file in every
following segment. This option is useful when segmenting XML or HTML files,
because it allows the HTML or XML header to be repeated for each segment.
Filename filters
For each rule, a filename filter determines which files the rule applies
to. If more than one rule could apply to a particular file, the first
one to match the filename is the one applied. A rule that does not have
a filter will be ignored.
If you use File Segmentation Rule with XML or HTML files, use the First segment in a file is header for other segments checkbox to make sure that the XML or HTML header is repeated for each segment. Otherwise, each segment will lack the <html> or <?xml> header that is necessary for correct identification of the file type.
Segment markers for XML and HTML are based on the raw text in the file, including tags and comments. For example, suppose you have an HTML file that looks like this:
<html>
<body>
<h3>Sample File</h3>
<!-- Segment -->
<p>This is the first segment</p>
<!-- Segment -->
<p>This is the second segment</p>
<!-- Segment -->
<p>This is the third segment</p>
You could use "<!-- Segment -->" the marker separating segments, even though these comments are not visible when you open an HTML file in dtSearch. Using this marker, and setting the option to make the first segment a header for the other segments, dtSearch would index the HTML file as three separate HTML documents:
<html>
<body>
<h3>Sample File</h3>
<!-- Segment -->
<p>This is the first segment</p>
<html>
<body>
<h3>Sample File</h3>
<!-- Segment -->
<p>This is the second segment</p>
<html>
<body>
<h3>Sample File</h3>
<!-- Segment -->
<p>This is the third segment</p>
File segmentation rules are stored in an XML file named fileseg.xml. You can create this file using the dtSearch Desktop GUI or by writing the XML in a program. To tell the dtSearch Engine to use file segmentation rules in your application, use the SegmentationRulesFile property of the Options object (.NET, COM, or Java), or the segmentationRulesFile member of dtsOptions (C++).
To obtain the text of a segment, you can use the FileConverter object or DFileConvertJob and pass in the name of a segment, as retrieved in a search, as the filename.