Last Reviewed: February 10, 2019
Applies to: dtSearch 7.85 and later
Supported file formats
Other file formats
Image file formats
Embedded object and attachment extraction
dtSearch can automatically recognize, index, search and display documents, including graphic marking of hits and multiple hit and file navigation options, in the file formats listed below. HTML and PDF documents appear with all formatting and embedded images and links intact, exactly as in the original document. PDF files are displayed using Adobe Reader and a dtSearch plug-in to enable hit highlighting. dtSearch developer products can display XML files with XSL formatting. dtSearch converts other file types to HTML for display with highlighted hits. dtSearch uses its own built-in document filters for document parsing and display, unless otherwise noted. All file formats are supported through the current release versions, unless otherwise noted.
dtSearch file format support is included with all dtSearch products and can also be licensed separately -- for information, please contact dtSearch.
File type identification
dtSearch generally detects file formats by examining the actual file contents, not the extension or reported MIME type, so it is not affected by misleading filenames. For example, a Word document named "sample.exe" would still be identified as a Word document. In some ambiguous cases, such as distinguishing XML and HTML files, the extension is used as a clue.
File size limits
A single dtSearch index can hold up to 1 terabyte of text. dtSearch does not limit the number of indexes you can create.
Container file formats such as ZIP, MBX, PST, and CSV have no specific size limit. dtSearch can index files larger than 4gb in these formats.
Individual documents can be up to 2Gb in size and will be indexed fully. dtSearch uses efficient memory management to handle even very large files. If a file is too large to be processed using available memory, the file will be skipped and the name recorded in the log of indexing errors.
If a single file is larger than 2Gb and does not appear to be in a recognized container format, dtSearch will handle it as a unrecognized binary file and use the filtering algorithm to extract text from the file.
Document filters overview:
Filters and Supported Data"
International language support:
dtSearch supports all languages through Unicode support. See "Unicode Support" and "International Language Support".
See "How to index databases with the dtSearch Engine."
Dynamically-generated content generated by ASP.NET, CMS, Sharepoint and similar products (*.jsp, *.asp, *.aspx, *.php, etc.):
See "How to use dtSearch Web with dynamically-generated web sites".
GroupWise, Lotus Notes, and other message archive formats:
See "Email conversion tools".
To use IFilters to add support for unsupported formats:
See "How to use dtSearch with IFilters".
For scanned document data that requires OCR:
See "How to use dtSearch or dtSearch Web with OCR"
Adobe Framemaker MIF (*.mif)
Adobe Photoshop images (metadata only) (*.psd)
Ami Pro (*.sam)
Ansi Text (*.txt)
Apple iWork KeyNote 2009 (*.key)
Apple iWork Numbers 2009 (*.numbers)
Apple iWork Pages 2009 (*.pages)
ASF media files (metadata only) (*.asf)
CSV (Comma-separated values) (*.csv)
EML (emails saved by Outlook Express) (*.eml)
Enhanced Metafile Format (*.emf)
EMF Spool (*.spl)
Eudora MBX message files (*.mbx)
Hancom Hanword (*.hwp)
Hancom Hanword 97(*.hwp)
HTML (*.htm, *.html)
Ichitaro (versions 5 and later) (*.jtd, *.jbw)
Lotus 1-2-3 (*.123, *.wk?)
MBOX email archives such as Thunderbird, including attachments (see note 5) (*.mbx)
MHT archives (web pages saved by Internet Explorer in the "Web archive, single file" format) (*.mht)
MIME messages, including attachments (see note 5)
MSG (emails saved by Outlook), including attachments (see note 5) (*.msg)
Microsoft Access 95, 97, 2000, 2003, 2007, 2010, 2013, and 2016 MDB (see note 1) (*.mdb, *.accdb)
Microsoft Document Imaging (*.mdi)
Microsoft Excel for Mac 2.2, 3, 4, 5, 98, 2001, X, 2004, 2008, 2011
Microsoft Excel for Windows 2, 3, 4, 5
Microsoft Excel 95, 97, 2000, XP, 2003, 2007, 2010, 2013, 2016 (*.xls)
Microsoft Excel 2003 XML (*.xml)
Microsoft Excel Office Open XML 2007, 2010, 2013, and 2016 (*.xlsx)
Microsoft OneNote 2007, 2010, 2013, and 2016 (*.one)
Microsoft Outlook 97, 2000, 2003, 2007, 2010, 2013, and 2016 data files, including attachments (see note 5) (*.PST, *.OST)
Microsoft Outlook/Exchange Messages, Notes, Contacts, Appointments, and Tasks (see note 2)
Microsoft Outlook Express 5 and 6 (*.dbx) message stores
Microsoft PowerPoint 3, 4, 95, 97, 98, 2000, 2001, 2002, 2003, 2004, 2007, 2008, 2010, 2011, 2013, 2016 (*.ppt)
Microsoft PowerPoint Office Open XML 2007, 2010, 2013, and 2016 (*.pptx)
Microsoft Rich Text Format (*.rtf)
Microsoft Searchable Tiff (*.tiff)
Microsoft Word for DOS 1, 2, 3, 4, 5, 6 (*.doc)
Microsoft Word for Mac 1, 3, 4, 5, 6, 98, 2001, X, 2004, 2008, 2011
Microsoft Word for Windows 1, 2, 6 (*.doc)
Microsoft Word 95, 97, 98, 2000, 2002, 2003, 2007, 2010, 2013, 2016 (*.doc)
Microsoft Word 2003 XML (*.xml)
Microsoft Word Office Open XML 2007, 2010, 2013, 2016 (*.docx)
Microsoft Works WP (*.wks)
MP3 (metadata only) (*.mp3)
Multimate Advantage II (*.dox)
Multimate version 4 (*.doc)
OpenOffice/LibreOffice versions 1, 2, 3, 4, and 5 documents, spreadsheets, and presentations (*.sxc, *.sxd, *.sxi, *.sxw, *.sxg, *.stc, *.sti, *.stw, *.stm, *.odt, *.ott, *.odg, *.otg, *.odp, *.otp, *.ods, *.ots, *.odf) (includes OASIS Open Document Format for Office Applications)
PDF 1.x files (*.pdf) (see note 6)
PDF 2.x files (*.pdf) (see note 7)
PDF Portfolio files (*.pdf), including embedded non-PDF documents.
Quattro Pro (*.wb1, *.wb2, *.wb3, *.qpw)
QuickTime (*.mov, *.m4a, *.m4v)
RAR (*.rar) (see note 4)
TIFF (metadata only) (*.tif)
Treepad HJT files (*.hjt)
Unicode (UCS16, Mac or Windows byte order, or UTF-8)
Visio XML files (*.vdx)
Windows Metafile Format (*.wmf)
WMA media files (metadata only) (*.wma)
WMV video files (metadata only) (*.wmv)
WordPerfect 4.2 (*.wpd, *.wpf)
WordPerfect (5.0 and later) (*.wpd, *.wpf)
WordStar version 1, 2, 3 (*.ws)
WordStar versions 4, 5, 6 (*.ws)
XBase (including FoxPro, dBase, and other XBase-compatible formats) (*.dbf)
XML Paper Specification (*.xps)
ZIP (*.zip) (PKZIP 2.0-compatible)
 Databases. Beginning with version 7.54, dtSearch no longer uses ODBC or any Microsoft database drivers to index Microsoft Access files. Earlier versions relied on ODBC to parse Access files. Each record of a database is indexed as a separate document. For information on indexing SQL databases, click here.
 Outlook and Exchange. dtSearch Desktop/Network can index Outlook and Exchange message stores using MAPI. For more information, see How to index Outlook and Exchange messages with dtSearch. dtSearch versions 7.77 and later can also index Outlook PST and OST files directly, without using Outlook or MAPI.
 Web Sites. dtSearch products include a spider that can index and search dynamically-generated content or static content on web sites. For more information, click here.
 RAR Support. RAR support currently applies to the Windows and Linux versions of dtSearch only.
 Attachments. In all supported email formats, attachments, including nested attachments (for example, a .doc instead a ZIP attached to an email) are indexed as part of the main document by default. For options to index attachments separately, see How to index attachments separately from email messages.
 PDF Support. Encrypted PDF files cannot be indexed, unless the PDF file can be opened without a password and the PDF file permissions allow for text extraction. For more information, see Security passwords on PDF files.
 PDF 2.0 Support. dtSearch 7.93 addspreliminary support for the new PDF 2.0 file format. PDF 2.0 is the first major change in the PDF file format since PDF 1.0 in 1993. Support is "preliminary" because while there are some tools that can open PDF 2.0 files now, end-user commercial software products have not yet started to support generation of new PDF 2.0 output so there is almost no data available for testing. Because this new PDF version changes the header information, dtSearch versions before 7.93 will not recognize the PDF 2.0 file format and will miss all content in these files. Therefore, it is essential to use dtSearch 7.93 or later before attempting to index and search PDF 2.0 files.
 Office 365. Supported Microsoft Office formats are also supported when saved from Office 365.
The dtSearch Engine automatically detects fields in the following file formats:
Email files (Outlook Express, Eudora, MBOX, EML)
To, CC, BCC, From, Sent Via, Sender, Recipient, Subject, Date, Attachments
Outlook items and .MSG files
To, CC, BCC, From, Sent Via, Sender, Recipient, Subject, Date, Sent Date, Delivered Date, Attachments, contact fields (StreetAddress, CompanyName, etc.)
Microsoft Word, Excel, PowerPoint
Document summary information fields
OpenOffice/Open Document Format
Document properties fields
META tags; If enabled in Options.FieldFlags, <TITLE> is indexed as HtmlTitle field; <H1>, <H2>, <H3> are indexed as HtmlH1, HtmlH2, HtmlH3, etc.
All fields (CSV, or comma-separated values, files must have a .csv extension, a list of field names in the first line, and must use tab, comma, or semicolon delimiters)
Document summary information fields
All metadata fields
EXIF and IPTC metadata fields; XMP (Vista) metadata supported in version 7.40
ASF, WMA, WMV
All metadata fields
dtSearch will still index, search, and display other file formats, but they will be treated as binary file types. In other words, all binary codes, etc. will be displayed along with the text. dtSearch can also use a proprietary binary file filtering algorithm to clean up these file formats. For more information see Indexing Options in the dtSearch help file.
For legacy file types in which multiple messages or log entries are stored in one very large text file, use the dtSearch File Segmentation Rules feature to tell dtSearch how to break up the file into multiple logical subdocuments. For more information, see File Segmentation Rules in the dtSearch help file.
dtSearch products can extract and display embedded images in these document formats: Word 97 and later (*.doc/*.docx), PowerPoint 97 and later (*.ppt/*.pptx), Excel 97 and later, (*.xls/*.xlsx), Access (*.mdb/accdb), RTF, email files including Thunderbird (mbox/*.eml), and Outlook (*.pst/*.msg) files, and OneNote 2007 through OneNote 2016 (*.one). Images are displayed using the HTML <img> tag and are not converted, so only images such as *.jpg and *.png that can be displayed in a browser will appear.
Additionally, dtSearch can display HTML and PDF files with hits highlighted, including embedded images.
Embedded objects and attachments are indexed as part of the document that contains them. For example, a spreadsheet object embedded in a PowerPoint presentation would be treated as part of the PowerPoint presentation.
For applications that require direct access to embedded objects and attachments, the ExtractionOptions API provides a way to extract embedded objects and attachments from a document into a folder tree. For information on this API, see ExtractionOptions (.NET and Java) or dtsExtractionOptions (C++). Extraction of embedded objects and attachments is supported in these formats: attachments in MIME emails (mbox/*.eml), Outlook messages (*.pst/*.msg), Outlook Express (*.dbx), TNEF (winmail.dat), PDF, Access (*.mdb/*.accdb), OneNote 2007 through OneNote 2016 (*.one); objects in Word 97 and later (*.doc/*.docx), PowerPoint 97 and later (*.ppt/*.pptx), Excel 97 and later (*.xls/*.xlsx), and RTF.