How to index attachments separately from email messages

Article: dts0219

Applies to: dtSearch 7.52 and later

Normally, dtSearch indexes each .eml file and each .msg file as a single document.  Attachments are recursively unpacked and appended to the message body, so no matter how many attachments there are, a single document is indexed for each message.  Using the File Types table, you can set up rules to require each message to be treated as a container, with the message body and attachments each indexed as a separate document in the container.

Indexing attachments separately using dtSearch Desktop

To access the File Types table in dtSearch Desktop, click Options > Preferences > File Types

To index .eml files as containers,

(1) Run dtSearch Desktop

(2) Click Options > Preferences > File Types

(3) Set up a rule defining *.eml as having the type "MIME Container"

(4) Check the box to "Override all other file type detection methods for these files"

(5) Index the files

To index .msg files as containers,

(1) Run dtSearch Desktop

(2) Click Options > Preferences > File Types

(3) Set up a rule defining *.msg as having the type "Outlook MSG Container"

(4) Check the box to "Override all other file type detection methods for these files"

(5) Index the files

Indexing attachments separately using the dtSearch Engine API

Using the dtSearch Engine API, first follow the steps above in dtSearch Desktop to create a filetype.xml file.  This file will be stored in your dtSearch user data folder.

In your application set Options.FileTypeTableFile to the location of the filetype.xml file to use, and call Options.Save to save the settings change.

Filtering message bodies and attachments by extension

Include filters will only apply inside a container if they have this format:

*.eml>*.doc

*.msg>*.doc

Message bodies have the extension .body.

For example, to index only .doc files and message bodies, use this filename filter in IndexJob:

*.eml  *.eml>*.doc *.eml>*.body

The first *.eml selects the files that are indexed, and the other two expressions select what is indexed inside each .eml container.

How to determine where a hit was found in a message

The filename returned after a search will indicate whether the item is a message body (.body extension) or an attachment (other extensions).  

Attachments inherit properties of the message they were sent with, so a search on a particular subject, sender, or recipient will find both the message body and the attachment.   However, a search on text in the message body will only find the message body.