Troubleshooting -- HTML files are indexed as text

Article: dts0191

Symptoms

- HTML tags are visible in dtSearch Desktop or dtSearch Web when opening a file

- Variable names in scripts, comments, or URLs are searchable after indexing HTML files

- META tags and other fields in an HTML file are not searchable

Troubleshooting Steps

(1) Check that dtSearch is interpreting the file as HTML..

If HTML tags are visible in the file when you open it in dtSearch Desktop or dtSearch Web, then dtSearch is interpreting the file as plain text.  dtSearch detects HTML files by checking for standard HTML headers like <HTML> or <!DOCTYPE html> as well as common HTML tags.  If none of these are present, dtSearch may conclude that a file is plain text, in which case tags will be indexed like other text.

To make dtSearch Desktop treat a group of files as HTML even though dtSearch thinks the files are text,

(1) Open dtSearch Desktop and click Options > Preferences > File Types

(2) Click New... to create a new rule and give the rule a name (the name does not matter)

(3) Under File type" select HTML

(4) Under Filename filters, enter filter expressions like *.asp to identify the files that are to be covered by the rule.

(5) Check the box labelled Override all other file type detection methods for these files (this will make dtSearch treat a file as HTML even if the HTML file parser does not recognize it as HTML based on the header).

(6) Click OK

The File Types dialog box can also be used to make dtSearch index text files as XML, or to make dtSearch index HTML or XML files as plain text.  

To use these settings in the dtSearch Engine API, use FileTypeTableFile property of the Options object to tell dtSearch to use the filetype.xml file that dtSearch Desktop creates after following the procedure described above.

(2) Check the option settings used to index HTML files

Even if a file is indexed as HTML, non-text content such as scripts, styles, links, comments, or the filename may be searchable because of option settings.   To change these option settings, start dtSearch Desktop and click Options > Preferences > Indexing Options.  

To prevent words in the filename or directory name from being searchable, un-check the Index filenames as text box.

To prevent words in scripts, styles, links, and comments from being searchable, un-check the Index HTML scripts, styles, links and comments box.

To prevent META tags from being searchable, un-check the Index document properties box.  (This will also affect Microsoft Office Document Summary Information fields.)