How to use dtSearch Web with dynamically-generated content

Article: dts0180

Applies to: dtSearch Web

To use dtSearch Web to search a web site, first create an index of the web site with the dtSearch Indexer. If the web site consists of static web pages (HTML files, PDF files, etc.) located on the same computer as dtSearch Web, click Add folder in the dtSearch Indexer to add the folders with the web pages to the index. If the web site is dynamically generated or located on another server, there will be no local folder with web pages to index. Instead, you can use the dtSearch Spider to crawl the web site.

To index your web site with the dtSearch Spider, click Add Web in the dtSearch Indexer and provide the starting address for the crawl (usually your site's home page). The dtSearch Indexer will traverse the web site by following the links connecting the pages. Because the Spider follows the same links that a web browser would use to navigate your site, it will be able to index the dynamically-generated content just as it is presented on your web site.

For programmers, there is a .NET API for the Spider in the dtSearch Text Retrieval Engine. For API documentation, click here or see the dtSearchNetApi2.chm help file.

Highlighting hits

To ensure that you can highlight hits in documents retrieved from a dynamically-generated site, create the index with the options to "Cache documents" and to "Cache document text" enabled. These options are set in the Index > Create (Advanced) dialog box. When content is cached in the index, dtSearch and dtSearch Web can highlight hits from the cached data, without the need to download the pages again from the site, which makes hit highlighting faster and more reliable. For more information on this option, see: "Caching Documents and Text in an Index" in the dtSearch help file.

An option setting to control whether hits are highlighted in content indexed using the Spider is in dtSearch Web Setup's Form Builder dialog box, in the Search Results tab. You can also change this setting after a search form is created by editing the dtSearch_options.html file. The option setting is controlled by this item in the dtSearch_options.html file:

<BR><HR><I>Highlight documents indexed via HTTP: </I>

If the option is set to 0 (off), then dtSearch Web will return direct links to any pages indexed by the spider, so the page will be displayed just as it appears normally. If the option is set to 1 (on), then dtSearch Web will request the page itself, add hit highlight markings, and then display the page with hits highlighted.

Excluding sections of web pages

Often pages generated by a content manager will contain sections of HTML that you would not want to be indexed, such as the table of contents and navigation menus. To tell dtSearch not to index parts of an HTML file, add HTML comments around the text to be excluded, like this:

... nothing here will be searchable...

The BeginNoIndex and EndNoIndex tags must look exactly as they do in this example. dtSearch will skip everything between the two markers when it indexes web pages.

Excluding sections of web sites

In the dtSearch Indexer, you can use filename filters and exclude filters to limit indexing by filename or folder name. For example, you could use a filter of */OnlyThisFolder/* to limit indexing to documents in a folder named OnlyThisFolder, or you could use an exclude filter of */NotThisFolder/* to prevent anything in the folder named NotThisFolder (or subfolders) from being indexed. For more information on filename filters, see: How to exclude folders from an index.

Additionally, the dtSearch Spider checks for robots.txt and robots META tags in web pages, so you can use a robots.txt file or embedded tags in web pages to specify whether they should be indexed, and whether the Spider should check them for links when indexing the site. For more information on robots.txt and the robots META tag standard, see:

https://www.robotstxt.org

Troubleshooting - Passwords

Click Options > Preferences > Spider Passwords to set up user names and passwords for a site that requires logging in. Some sites such as Sharepoint have types of authentication that will not work with the dtSearch Spider. For example, the site may include a unique code on each login form that must be returned with that login form, which makes it impossible to log in automatically. For more information see Troubleshooting -- Spider forms authentication problems.

How to use dtSearch Web with dynamically-generated content

Highlighting hits

Excluding sections of web pages

Excluding sections of web sites

Troubleshooting - Passwords

Related Topics