Using the Spider to Index Web Sites

To index a web site with dtSearch, click Add web in the Update Index dialog box. You can do this multiple times to add any number of web sites to an index. To modify a web site in the Update Index dialog box, right-click the name in the What to index list and select Modify web site.

When indexing using the Spider, it is usually a good idea to enable caching of documents and text in the index, so dtSearch can highlight hits from the cached data. This ensures that you can search and browse results even if you cannot access the site.

To index an entire web site using a sitemap, enter the URL of the sitemap as the start page. Example: https://www.example.com/sitemap.xml. dtSearch supports XML and compressed (.gz) sitemaps. For more information on sitemaps, see https://www.sitemaps.org.

If you have problems accessing the site using the Spider, try changing the "User agent identification" to Internet Explorer. Some sites vary their appearance based on the user's browser, and if the site does not recognize the user agent name, it may return incorrect pages or fail to respond.

Limiting the Spider

To limit the Spider to particular areas of the site, use the Filename filters and Exclude filters in the Update Index dialog box. A filter with a / will be matched against the complete URL, so a filter of */OnlyThisOne/* would limit the indexer to documents under the "OnlyThisOne" folder. The Spider will also obey any instructions in a robots.txt file on the web site or in a robots meta tag. For more information on robots.txt and robots meta tags, see https://www.robotstxt.org.

Starting page for web site
This is the first page dtSearch will request from the site to start the crawl. Usually this will be the home page of the web site.

Crawl depth
The crawl depth is number of levels into the web site dtSearch will reach when looking for pages. When dtSearch indexes a web site, it starts from the page you specify, indexes that page, and then looks for links from that page to other pages on the site. For each of those pages, it looks for links to still more pages. With a crawl depth of zero, dtSearch would index only the starting page. With a crawl depth of 1, dtSearch would index only pages that are directly linked to the starting page.

Authentication settings and Passwords...
If the site requires authentication, click Passwords... to set up a username and password.

Allow the Spider to access web servers other than the starting server
By default, the Spider will not follow links to servers other than the starting server. For example, if the start page for the crawl is www.dtsearch.com, the Spider will not follow links to support.dtsearch.com. To enable the Spider to follow links to other servers, check this box and list the other servers to include. You can use wildcards to specify the server names to match. For example, *.dtsearch.com would match www.dtsearch.com, support.dtsearch.com, and download.dtsearch.com.

Stop crawl after __ files
Use this setting to limit the number of pages the Spider should index on a web site.

Stop crawl after __ minutes
Use this setting to limit the amount of time the Spider will spend crawling pages on a web site.

Skip files larger than __ kilobytes
Use this setting to limit the maximum size of files that the Spider will attempt to access.

Time to pause between page downloads
Requiring the Spider to pause between page downloads can reduce the effect of indexing on the web server.

User agent identification
Some web sites behave differently depending on the web browser being used to access them. For these sites, you can use the User agent identification to specify a user agent name (for example, Internet Explorer) for the Spider to use, so the Spider will index the same view of the web site that users see with a web browser.