What file formats does dtSearch support?

Article: dts0103

dtSearch can automatically recognize, index, search and display documents, including graphic marking of hits and multiple hit and file navigation options, in the file formats listed below.  HTML and PDF documents appear with all formatting and embedded images and links intact, exactly as in the original document.  PDF files are displayed using Adobe Reader and a dtSearch plug-in to enable hit highlighting.  dtSearch developer products can display XML files with XSL formatting.  dtSearch converts other file types to HTML for display with highlighted hits.  dtSearch uses its own built-in document filters for document parsing and display, unless otherwise noted.  All file formats are supported through the current release versions, unless otherwise noted.

dtSearch file format support is included with all dtSearch products and can also be licensed separately -- for information, please contact dtSearch.

File type identification

dtSearch generally detects file formats by examining the actual file contents, not the extension or reported MIME type, so it is not affected by misleading filenames.  For example, a Word document named "sample.exe" would still be identified as a Word document.  In some ambiguous cases, such as distinguishing XML and HTML files, the extension is used as a clue.  

File size limits

A single dtSearch index can hold up to 1 terabyte of text.  dtSearch does not limit the number of indexes you can create.

Container file formats such as ZIP, MBX, PST, and CSV have no specific size limit.  dtSearch can index files larger than 4gb in these formats.

Individual documents can be up to 2Gb in size and will be indexed fully.  dtSearch uses efficient memory management to handle even very large files.  If a file is too large to be processed using available memory, the file will be skipped and the name recorded in the log of indexing errors.  

If a single file is larger than 2Gb and does not appear to be in a recognized container format, dtSearch will handle it as a unrecognized binary file and use the filtering algorithm to extract text from the file.  

Document filters overview:  Document Filters and Supported Data

International language support:  dtSearch supports all languages through Unicode support. See Unicode Support and International Language Support.

SQL databases: How to index databases with the dtSearch Engine.

Dynamically-generated content generated by ASP.NET, CMS, Sharepoint and similar products (*.jsp, *.asp, *.aspx, *.php, etc.): How to use dtSearch Web with dynamically-generated web sites.

GroupWise, Lotus Notes, and other message archive formats: Email conversion tools.

To use IFilters to add support for unsupported formats: How to use dtSearch with IFilters.

For scanned document data that requires OCR: How to use OCR output files with dtSearch products

Supported file formats

Adobe Framemaker MIF (*.mif)

Ami Pro (*.sam)

Ansi Text (*.txt)

Apple iWork KeyNote 2009 (*.key)

Apple iWork Numbers 2009 (*.numbers)

Apple iWork Pages 2009 (*.pages)

ASCII Text

CSV (Comma-separated values) (*.csv)

DBF (*.dbf)

EBCDIC

EML (emails saved by Outlook Express) (*.eml)

Enhanced Metafile Format (*.emf)

EMF Spool (*.spl)

Eudora MBX message files (*.mbx)

Flash (*.swf)

GZIP (*.gz)

Hancom Hanword (*.hwp)

Hancom Hanword 97(*.hwp)

Hancom Hanword (*.hwpx) (versions 2021.02 and later)

HTML (*.htm, *.html)

iCalendar (*.ics)

Ichitaro (versions 5 and later) (*.jtd, *.jbw)

Lotus 1-2-3 (*.123, *.wk?)

MBOX email archives such as Thunderbird, including attachments (see note 5) (*.mbx)

MHT archives (web pages saved by Internet Explorer in the "Web archive, single file" format) (*.mht)

MIME messages, including attachments (see note 5)

MSG (emails saved by Outlook), including attachments (see note 5) (*.msg)

Microsoft Access 95, 97, 2000, 2003, 2007, 2010, 2013, and 2016 MDB (see note 1) (*.mdb, *.accdb)

Microsoft Excel for Mac 2.2, 3, 4, 5, 98, 2001, X, 2004, 2008, 2011

Microsoft Excel for Windows 2, 3, 4, 5

Microsoft Excel 95, 97, 2000, XP, 2003, 2007, 2010, 2013, 2016  (*.xls)

Microsoft Excel 2003 XML (*.xml)

Microsoft Excel Office Open XML 2007, 2010, 2013, and 2016 (*.xlsx)

Microsoft OneNote 2007, 2010, 2013, and 2016 (*.one)

Microsoft Outlook 97, 2000, 2003, 2007, 2010, 2013, and 2016 data files, including attachments (see note 5) (*.PST, *.OST)

Microsoft Outlook/Exchange Messages, Notes, Contacts, Appointments, and Tasks (see note 2)

Microsoft Outlook Express 5 and 6 (*.dbx) message stores

Microsoft PowerPoint 3, 4, 95, 97, 98, 2000, 2001, 2002, 2003, 2004, 2007, 2008, 2010, 2011, 2013, 2016 (*.ppt)

Microsoft PowerPoint Office Open XML  2007, 2010, 2013, and 2016 (*.pptx)

Microsoft Rich Text Format (*.rtf)

Microsoft Word for DOS 1, 2, 3, 4, 5, 6 (*.doc)

Microsoft Word for Mac 1, 3, 4, 5, 6, 98, 2001, X, 2004, 2008, 2011

Microsoft Word for Windows 1, 2, 6 (*.doc)

Microsoft Word 95, 97, 98, 2000, 2002, 2003, 2007, 2010, 2013, 2016 (*.doc)

Microsoft Word 2003 XML (*.xml)

Microsoft Word Office Open XML 2007, 2010, 2013, 2016 (*.docx)

Microsoft Works WP (*.wks)

Multimate Advantage II (*.dox)

Multimate version 4 (*.doc)

OpenOffice/LibreOffice versions 1, 2, 3, 4, and 5 documents, spreadsheets, and presentations (*.sxc, *.sxd, *.sxi, *.sxw, *.sxg, *.stc, *.sti, *.stw, *.stm, *.odt, *.ott, *.odg, *.otg, *.odp, *.otp, *.ods, *.ots, *.odf) (includes OASIS Open Document Format for Office Applications)

PDF 1.x files (*.pdf) (see note 6)

PDF 2.x files (*.pdf)  (see note 7)

PDF Portfolio files (*.pdf), including embedded non-PDF documents.

Quattro Pro (*.wb1, *.wb2, *.wb3, *.qpw)

RAR (*.rar) (see note 4)

TAR (*.tar)

TNEF (winmail.dat)

Treepad HJT files (*.hjt)

Unicode (UCS16, Mac or Windows byte order, or UTF-8)

Visio XML files (*.vdx)

Windows Metafile Format (*.wmf)

WordPerfect 4.2 (*.wpd, *.wpf)

WordPerfect (5.0 and later) (*.wpd, *.wpf)

WordStar version 1, 2, 3 (*.ws)

WordStar versions 4, 5, 6 (*.ws)

WordStar 2000

Write (*.wri)

XBase (including FoxPro, dBase, and other XBase-compatible formats) (*.dbf)

XML (*.xml)

XML Paper Specification (*.xps)

XSL

XyWrite

ZIP (*.zip) (PKZIP 2.0-compatible)

Media formats - metadata only
Adobe Photoshop images (*.psd)
APE (*.ape) (versions 2023.02 and later)
Audio Interchange Format (*.aiff) (versions 2023.02 and later)
ASF media files (*.asf)
Free Lossless Audio Codec (*.flac) (versions 2023.02 and later)
GIF (*.gif) (versions 2023.02 and later)
HEIF (*.heif) (versions 2023.02 and later)
JPEG (*.jpg)
Microsoft Searchable Tiff (*.tiff)
Microsoft Document Imaging (*.mdi)
MP3 (*.mp3)
MPEG-4 (*.m4a)
OGG (*.ogg) (versions 2023.02 and later)
OPUS (*.opus) (versions 2023.02 and later)
QuickTime (*.mov, *.m4a, *.m4v)
TIFF (*.tif)
WEBP (*.webp) (versions 2023.02 and later)
WAV (*.wav) (versions 2023.02 and later)
WMA media files  (*.wma)
WMV video files (*.wmv)

Notes

[1] Databases. Beginning with version 7.54, dtSearch no longer uses ODBC or any Microsoft database drivers to index Microsoft Access files.  Earlier versions relied on ODBC to parse Access files.  Each record of a database is indexed as a separate document.  See also: How to index databases with the dtSearch Engine

[2] Outlook and Exchange.  dtSearch Desktop/Network can index Outlook and Exchange message stores using MAPI.  For more information, see How to index Outlook and Exchange messages with dtSearch. dtSearch versions 7.77 and later can also index Outlook PST and OST files directly, without using Outlook or MAPI.

[3] Web Sites. dtSearch products include a spider that can index and search dynamically-generated content or static content on web sites.  For more information, click here.

[4] RAR Support. RAR support currently applies to the Windows and Linux versions of dtSearch only.

[5] Attachments.  In all supported email formats, attachments, including nested attachments (for example, a .doc inside a ZIP attached to an email) are indexed as part of the main document by default.  For options to index attachments separately, see How to index attachments separately from email messages.

[6] PDF Support.  Encrypted PDF files cannot be indexed, unless the PDF file can be opened without a password and the PDF file permissions allow for text extraction.  For more information, see Security passwords on PDF files.  

[7] PDF 2.0 Support.  dtSearch 7.93 adds support for the new PDF 2.0 file format. PDF 2.0 is the first major change in the PDF file format since PDF 1.0 in 1993.  Because this new PDF version changes the header information, dtSearch versions before 7.93 will not recognize the PDF 2.0 file format and will miss all content in these files. Therefore, it is essential to use dtSearch 7.93 or later before attempting to index and search PDF 2.0 files.

[8] Office 365.  Supported Microsoft Office formats are also supported when saved from Office 365.

Automatically-detected fields

dtSearch automatically detects fields in the following file formats:

 

File format

Fields

Email files (Outlook Express, Eudora, MBOX, EML)

To, CC, BCC, From, Sent Via, Sender, Recipient, Subject, Date, Attachments

Outlook items and .MSG files

To, CC, BCC, From, Sent Via, Sender, Recipient, Subject, Date, Sent Date, Delivered Date, Attachments, contact fields (StreetAddress, CompanyName, etc.)

Microsoft Word, Excel, PowerPoint

Document summary information fields

OpenOffice/Open Document Format

Document properties fields

HTML

META tags; If enabled in Options.FieldFlags, <TITLE> is indexed as HtmlTitle field; <H1>, <H2>, <H3> are indexed as HtmlH1, HtmlH2, HtmlH3, etc.

XML

All fields

DBF

All fields

CSV

All fields (CSV, or comma-separated values, files must have a .csv extension, a list of field names in the first line, and must use tab, comma, or semicolon delimiters)

PDF

Document Properties

WordPerfect

Document summary information fields

AAC MPEG4, XMP
AIFF RIFF, XMP
APE APEv2
FLAC Vorbis
GIF XMP
HEIF EXIF, XMP
JPG EXIF, IPTC, XMP
MP3 ID3
MOV QuickTime, XMP
ASF (*.wmv, *.wma) ASF tags
OGG Vorbis
OPUS Vorbis
PNG EXIF, IPTC, XMP
TIF EXIF, IPTC, XMP
WAV RIFF, XMP
WEBP RIFF, XMP

XMP, RIFF, Vorbis, and APEv2 metadata support applies to dtSearch versions 2023.02 and later.

Other File Formats

dtSearch will still index, search, and display other file formats, but they will be treated as binary file types. In other words, all binary codes, etc. will be displayed along with the text. dtSearch can also use a proprietary binary file filtering algorithm to clean up these file formats. For more information see Indexing Options in the dtSearch help file.

For legacy file types in which multiple messages or log entries are stored in one very large text file, use the dtSearch File Segmentation Rules feature to tell dtSearch how to break up the file into multiple logical subdocuments. For more information, see File Segmentation Rules in the dtSearch help file.

Image Formats

dtSearch products can extract and display embedded images in these document formats:   Word 97 and later (*.doc/*.docx), PowerPoint 97 and later (*.ppt/*.pptx), Excel 97 and later, (*.xls/*.xlsx), Access (*.mdb/accdb), RTF,  email files including Thunderbird (mbox/*.eml), and Outlook (*.pst/*.msg) files, and OneNote 2007 through OneNote 2016 (*.one).  Images are displayed using the HTML <img> tag and are not converted, so only images such as *.jpg and *.png that can be displayed in a browser will appear.

Additionally, dtSearch can display HTML and PDF files with hits highlighted, including embedded images.

Embedded Object and Attachment Extraction

Embedded objects and attachments are indexed as part of the document that contains them.  For example, a spreadsheet object embedded in a PowerPoint presentation would be treated as part of the PowerPoint presentation.  

For applications that require direct access to embedded objects and attachments, the ExtractionOptions API provides a way to extract embedded objects and attachments from a document into a folder tree.  For information on this API, see ExtractionOptions (.NET and Java) or dtsExtractionOptions (C++).   Extraction of embedded objects and attachments is supported in these formats:  attachments in MIME emails (mbox/*.eml), Outlook messages (*.pst/*.msg), Outlook Express (*.dbx), TNEF (winmail.dat), PDF, Access (*.mdb/*.accdb), OneNote 2007 through OneNote 2016 (*.one); objects in Word  97 and later (*.doc/*.docx), PowerPoint  97 and later (*.ppt/*.pptx), Excel  97 and later (*.xls/*.xlsx), and RTF.