How to segment long text, HTML, or XML files during indexing

Last Reviewed: August 5, 2013

Article: DTS0170

 

Applies to: dtSearch Desktop; dtSearch Engine

Long text files often consist of many subsections that can be considered to be separate documents, such as text report output with many pages in a standardized format.  XML files with data converted from a database table will typically contain a series of records. The File Segmentation Rules feature in dtSearch provides a way to tell dtSearch to index each item or record as a separate document, without breaking up the original text file into numerous tiny files.

File Segmentation Rules work with text files, HTML, and XML.  PDF files and word processor files such as Word or WordPerfect documents cannot be segmented using this method.

dtSearch automatically recognizes MBOX email archives, so it is not necessary to use File Segmentation Rules with  MBOX archives.

Creating File Segmentation Rules

To set up a file segmentation rule, click Options > Preferences in dtSearch and select the "File Segmentation Rules" tab. Each rule has the following parts:

Name
The name of a rule is used only to identify it in the File Segmentation Rules dialog box.

New document starts at
This is a marker that indicates when a new document begins. To avoid incorrectly splitting a message, this marker should be as unique as possible.

How to check for document boundaries in text
Each line of the files a rule applies to will be compared against the marker under New document starts at. Three types of comparison are available:

Ignore case
Match a document boundary even if the capitalization does not match.

First segment in a file is header for other segments
Check this box to have dtSearch insert the first segment in a file in every following segment. This option is useful when segmenting XML or HTML files, because it allows the HTML or XML header to be repeated for each segment.

Filename filters
For each rule, a filename filter determines which files the rule applies to. If more than one rule could apply to a particular file, the first one to match the filename is the one applied. A rule that does not  have a filter will be ignored.

Segmenting XML and HTML

If you use File Segmentation Rule with XML or HTML files, use the First segment in a file is header for other segments checkbox so the XML or HTML header will be included for each segment. Otherwise, each segment will lack the <html> or <?xml> header that is necessary for correct identification of the file type.

Segment markers for XML and HTML are Based on the raw text in the file, including tags and comments. For example, suppose you have an HTML file that looks like this:

<html>

<body>

<h3>Sample File</h3>

<!-- Segment -->

<p>This is the first segment</p>

<!-- Segment -->

<p>This is the second segment</p>

<!-- Segment -->

<p>This is the third segment</p>

You could use "<!-- Segment -->" the marker separating segments, even though these comments are not visible when you open an HTML file in dtSearch. Using this marker, and setting the option to make the first segment a header for the other segments, dtSearch would index the HTML file as three separate HTML documents:

 

<html>

<body>

<h3>Sample File</h3>

<!-- Segment -->

<p>This is the first segment</p>

 

<html>

<body>

<h3>Sample File</h3>

<!-- Segment -->

<p>This is the second segment</p>

 

<html>

<body>

<h3>Sample File</h3>

<!-- Segment -->

<p>This is the third segment</p>

Developer Information

File segmentation rules are stored in an XML file named fileseg.xml.  You can create this file using the dtSearch Desktop GUI or by writing the XML in a program. To tell the dtSearch Engine to use file segmentation rules in your application, use the SegmentationRulesFile property of the Options object (.NET, COM, or Java), or the segmentationRulesFile member of dtsOptions (C++).

To obtain the text of a segment, you can use the FileConverter object or DFileConvertJob and pass in the name of a segment, as retrieved in a search, as the filename.