What file formats does dtSearch support?

Last Reviewed: January 5, 2017

Article: DTS0103

Applies to: dtSearch 7.85 and later

Supported file formats

Automatically-recognized fields

Other file formats

Image file formats

Embedded object and attachment extraction

dtSearch can automatically recognize, index, search and display documents, including graphic marking of hits and multiple hit and file navigation options, in the file formats listed below.  HTML and PDF documents appear with all formatting and embedded images and links intact, exactly as in the original document.  PDF files are displayed using Adobe Reader and a dtSearch plug-in to enable hit highlighting.  dtSearch developer products can display XML files with XSL formatting.  dtSearch converts other file types to HTML for display with highlighted hits.  dtSearch uses its own built-in document filters for document parsing and display, unless otherwise noted.  All file formats are supported through the current release versions, unless otherwise noted.

dtSearch file format support is included with all dtSearch products and can also be licensed separately -- for information, please contact dtSearch.

File type identification  
dtSearch generally detects file formats by examining the actual file contents, not the extension or reported MIME type, so it is not affected by misleading filenames.  For example, a Word document named "sample.exe" would still be identified as a Word document.  In some ambiguous cases, such as distinguishing XML and HTML files, the extension is used as a clue.  

File size limits
A single dtSearch index can hold up to 1 terabyte of text.  dtSearch does not limit the number of indexes you can create.

Container file formats such as ZIP, MBX, PST, and CSV have no specific size limit.  dtSearch can index files larger than 4gb in these formats.

Individual documents can be up to 2Gb in size and will be indexed fully.  dtSearch uses efficient memory management to handle even very large files.  If a file is too large to be processed using available memory, the file will be skipped and the name recorded in the log of indexing errors.  

If a single file is larger than 2Gb and does not appear to be in a recognized container format, dtSearch will handle it as a unrecognized binary file and use the filtering algorithm to extract text from the file.  

Related Topics

Document filters overview:

See "Document Filters and Supported Data"

International language support:

dtSearch supports all languages through Unicode support. See "Unicode Support" and "International Language Support".

SQL databases:

See "How to index databases with the dtSearch Engine."

Dynamically-generated content generated by ASP.NET, CMS, Sharepoint and similar products (*.jsp, *.asp, *.aspx, *.php, etc.):

See "How to use dtSearch Web with dynamically-generated web sites".

GroupWise, Lotus Notes, and other message archive formats:

See "Email conversion tools".

To use IFilters to add support for unsupported formats:

See "How to use dtSearch with IFilters".

For scanned document data that requires OCR:

See "How to use dtSearch or dtSearch Web with OCR"

Supported file formats

Adobe Framemaker MIF (*.mif)

Adobe Photoshop images (metadata only) (*.psd)

Ami Pro (*.sam)

Ansi Text (*.txt)

Apple iWork KeyNote 2009 (*.key)

Apple iWork Numbers 2009 (*.numbers)

Apple iWork Pages 2009 (*.pages)

ASCII Text

ASF media files (metadata only) (*.asf)

CSV (Comma-separated values) (*.csv)

DBF (*.dbf)

EBCDIC

EML (emails saved by Outlook Express) (*.eml)

Enhanced Metafile Format (*.emf)

EMF Spool (*.spl)

Eudora MBX message files (*.mbx)

Flash (*.swf)

GZIP (*.gz)

HTML (*.htm, *.html)

iCalendar (*.ics)

Ichitaro (versions 5 and later) (*.jtd, *.jbw)

JPEG (*.jpg)

Lotus 1-2-3 (*.123, *.wk?)

MBOX email archives such as Thunderbird, including attachments (see note 5) (*.mbx)

MHT archives (HTML archives saved by Internet Explorer) (*.mht)

MIME messages, including attachments (see note 5)

MSG (emails saved by Outlook), including attachments (see note 5) (*.msg)

Microsoft Access 95, 97, 2000, 2003, 2007, 2010, 2013, and 2016 MDB (see note 1) (*.mdb, *.accdb)

Microsoft Document Imaging (*.mdi)

Microsoft Excel for Mac 2.2, 3, 4, 5, 98, 2001, X, 2004, 2008, 2011

Microsoft Excel for Windows 2, 3, 4, 5

Microsoft Excel 95, 97, 2000, XP, 2003, 2007, 2010, 2013, 2016  (*.xls)

Microsoft Excel 2003 XML (*.xml)

Microsoft Excel Office Open XML 2007, 2010, 2013, and 2016 (*.xlsx)

Microsoft OneNote 2007, 2010, 2013, and 2016 (*.one)

Microsoft Outlook 97, 2000, 2003, 2007, 2010, 2013, and 2016 data files, including attachments (see note 5) (*.PST, *.OST)

Microsoft Outlook/Exchange Messages, Notes, Contacts, Appointments, and Tasks (see note 2)

Microsoft Outlook Express 5 and 6 (*.dbx) message stores

Microsoft PowerPoint 3, 4, 95, 97, 98, 2000, 2001, 2002, 2003, 2004, 2007, 2008, 2010, 2011, 2013, 2016 (*.ppt)

Microsoft PowerPoint Office Open XML  2007, 2010, 2013, and 2016 (*.pptx)

Microsoft Rich Text Format (*.rtf)

Microsoft Searchable Tiff (*.tiff)

Microsoft Word for DOS 1, 2, 3, 4, 5, 6 (*.doc)

Microsoft Word for Mac 1, 3, 4, 5, 6, 98, 2001, X, 2004, 2008, 2011

Microsoft Word for Windows 1, 2, 6 (*.doc)

Microsoft Word 95, 97, 98, 2000, 2002, 2003, 2007, 2010, 2013, 2016 (*.doc)

Microsoft Word 2003 XML (*.xml)

Microsoft Word Office Open XML 2007, 2010, 2013, 2016 (*.docx)

Microsoft Works WP (*.wks)

MP3 (metadata only) (*.mp3)

Multimate Advantage II (*.dox)

Multimate version 4 (*.doc)

OpenOffice/LibreOffice versions 1, 2, 3, 4, and 5 documents, spreadsheets, and presentations (*.sxc, *.sxd, *.sxi, *.sxw, *.sxg, *.stc, *.sti, *.stw, *.stm, *.odt, *.ott, *.odg, *.otg, *.odp, *.otp, *.ods, *.ots, *.odf) (includes OASIS Open Document Format for Office Applications)

PDF files (*.pdf) (see note 6)

PDF Portfolio files (*.pdf), including embedded non-PDF documents.

Quattro Pro (*.wb1, *.wb2, *.wb3, *.qpw)

QuickTime (*.mov, *.m4a, *.m4v)

RAR (*.rar) (see note 4)

TAR (*.tar)

TIFF (metadata only) (*.tif)

TNEF (winmail.dat)

Treepad HJT files (*.hjt)

Unicode (UCS16, Mac or Windows byte order, or UTF-8)

Visio XML files (*.vdx)

Windows Metafile Format (*.wmf)

WMA media files (metadata only) (*.wma)

WMV video files (metadata only) (*.wmv)

WordPerfect 4.2 (*.wpd, *.wpf)

WordPerfect (5.0 and later) (*.wpd, *.wpf)

WordStar version 1, 2, 3 (*.ws)

WordStar versions 4, 5, 6 (*.ws)

WordStar 2000

Write (*.wri)

XBase (including FoxPro, dBase, and other XBase-compatible formats) (*.dbf)

XML (*.xml)

XML Paper Specification (*.xps)

XSL

XyWrite

ZIP (*.zip) (PKZIP 2.0-compatible)

Notes

[1] Databases. Beginning with version 7.54, dtSearch no longer uses ODBC or any Microsoft database drivers to index Microsoft Access files.  Earlier versions relied on ODBC to parse Access files.  Each record of a database is indexed as a separate document.  For information on indexing SQL databases, click here.

[2] Outlook and Exchange.  dtSearch Desktop/Network can index Outlook and Exchange message stores using MAPI.  For more information, see How to index Outlook and Exchange messages with dtSearch. dtSearch versions 7.77 and later can also index Outlook PST and OST files directly, without using Outlook or MAPI.

[3] Web Sites. dtSearch products include a spider that can index and search dynamically-generated content or static content on web sites.  For more information, click here.

[4] RAR Support. RAR support currently applies to the Windows and Linux versions of dtSearch only.

[5] Attachments.  In all supported email formats, attachments, including nested attachments (for example, a .doc instead a ZIP attached to an email) are indexed as part of the main document by default.  For options to index attachments separately, see How to index attachments separately from email messages.

[6] PDF Support.  Encrypted PDF files cannot be indexed, unless the PDF file can be opened without a password and the PDF file permissions allow for text extraction.  For more information, see Security passwords on PDF files.

Automatically-detected fields

The dtSearch Engine automatically detects fields in the following file formats:

 

File format

Fields

Email files (Outlook Express, Eudora, MBOX, EML)

To, CC, BCC, From, Sent Via, Sender, Recipient, Subject, Date, Attachments

Outlook items and .MSG files

To, CC, BCC, From, Sent Via, Sender, Recipient, Subject, Date, Sent Date, Delivered Date, Attachments, contact fields (StreetAddress, CompanyName, etc.)

Microsoft Word, Excel, PowerPoint

Document summary information fields

OpenOffice/Open Document Format

Document properties fields

HTML

META tags; If enabled in Options.FieldFlags, <TITLE> is indexed as HtmlTitle field; <H1>, <H2>, <H3> are indexed as HtmlH1, HtmlH2, HtmlH3, etc.

XML

All fields

DBF

All fields

CSV

All fields (CSV, or comma-separated values, files must have a .csv extension, a list of field names in the first line, and must use tab, comma, or semicolon delimiters)

PDF

Document Properties

WordPerfect

Document summary information fields

MP3

All metadata fields

JPG, TIFF

EXIF and IPTC metadata fields; XMP (Vista) metadata supported in version 7.40

ASF, WMA, WMV

All metadata fields

 

Other File Formats

dtSearch will still index, search, and display other file formats, but they will be treated as binary file types. In other words, all binary codes, etc. will be displayed along with the text. dtSearch can also use a proprietary binary file filtering algorithm to clean up these file formats. For more information see Indexing Options in the dtSearch help file.

For legacy file types in which multiple messages or log entries are stored in one very large text file, use the dtSearch File Segmentation Rules feature to tell dtSearch how to break up the file into multiple logical subdocuments. For more information, see File Segmentation Rules in the dtSearch help file.

Image Formats

dtSearch products can extract and display embedded images in these document formats:   Word 97 and later (*.doc/*.docx), PowerPoint 97 and later (*.ppt/*.pptx), Excel 97 and later, (*.xls/*.xlsx), Access (*.mdb/accdb), RTF,  email files including Thunderbird (mbox/*.eml), and Outlook (*.pst/*.msg) files, and OneNote 2007 through OneNote 2016 (*.one).  Images are displayed using the HTML <img> tag and are not converted, so only images such as *.jpg and *.png that can be displayed in a browser will appear.

Additionally, dtSearch can display HTML and PDF files with hits highlighted, including embedded images.

Embedded Object and Attachment Extraction

Embedded objects and attachments are indexed as part of the document that contains them.  For example, a spreadsheet object embedded in a PowerPoint presentation would be treated as part of the PowerPoint presentation.  

For applications that require direct access to embedded objects and attachments, the ExtractionOptions API provides a way to extract embedded objects and attachments from a document into a folder tree.  For information on this API, see ExtractionOptions (.NET and Java) or dtsExtractionOptions (C++).   Extraction of embedded objects and attachments is supported in these formats:  attachments in MIME emails (mbox/*.eml), Outlook messages (*.pst/*.msg), Outlook Express (*.dbx), TNEF (winmail.dat), PDF, Access (*.mdb/*.accdb), OneNote 2007 through OneNote 2016 (*.one); objects in Word  97 and later (*.doc/*.docx), PowerPoint  97 and later (*.ppt/*.pptx), Excel  97 and later (*.xls/*.xlsx), and RTF.