How to index a web site with the dtSearch Spider

Article: dts0102

dtSearch Spider

dtSearch includes a built-in web spider for indexing and searching internal or publicly-accessible Web sites. The dtSearch Spider automatically recognizes and supports HTML, PDF, XML, as well as other online text documents, such as word processor files and spreadsheets. dtSearch  will display Web pages and documents that the Spider finds with highlighted hits as well as (for HTML and PDF) links and images intact.  The Spider can index static as well as dynamically-generated pages.

For developers, the dtSearch Text Retrieval Engine includes a .NET API for the spider.  For API documentation, click here or see the dtSearchNetApi2.chm help file.

Indexing and Searching with the Spider

To index a web site in dtSearch , select "Add web" in the Update Index dialog box. Enter the name of the Web site, for example, www.example.com. Then select the crawl depth. The crawl depth is the number of levels into the web site dtSearch will reach when looking for pages. You could spider www.example.com to a crawl depth of 1 to reach only pages on the site linked directly to the home page. Or you could enter a crawl depth of 4 to reach four levels deep into the site.

The dtSearch Spider is a “polite” spider and will comply with exclusions specified in a web site's robots.txt file, if present.

For more information on web site indexing options, see:  Using the Spider to Index Web Sites
 

After a search, dtSearch Spider will display retrieved HTML or PDF files with hit highlighting, and all links and images intact. The result looks and acts just like the original web page, but with highlighted hits and additional navigation options ("next hit," "previous document," "next documents," etc.). dtSearch uses built-in HTML file converters to convert other text formats, such as word processor and spreadsheet files, to HTML for display with highlighted hits.

Troubleshooting -- Hit Highlighting is incorrect

By default, the dtSearch Spider does not "capture" an indexed Web sites. To display a file indexed with the dtSearch Spider, dtSearch will return to the Web site to access the document. If the Web site has changed since the indexing, then hit highlighting will be on an incorrect word.  To ensure that highlighting is correct, you can use the caching feature in dtSearch to have dtSearch store the web pages as they are indexed so hit highlighting is done using the stored data.   

Troubleshooting - Passwords

Click Options > Preferences > Spider Passwords to set up user names and passwords for a site that requires logging in.  Some sites such as Sharepoint have types of authentication that will not work with the dtSearch Spider.  For example, the site may include a unique code on each login form that must be returned with that login form, which makes it impossible to log in automatically.  For more information see Troubleshooting -- Spider forms authentication problems.

Related Topics

For more information on creating indexes, see "dtSearch Quick Start".  

For more information on setting up dtSearch Web, see "dtSearch Web Quick Start".

For more information on using the dtSearch Spider to index dynamically-generated content, see "How to use dtSearch Web with dynamically-generated content".