To index a web site with dtSearch, click Add Web in the Update Index dialog box. You can do this multiple times to add any number of web sites to an index. To modify a web site in the Update Index dialog box, right-click the name in the What to index list and select Modify web site.
Tips
1. When indexing using the Spider, it is usually a good idea to enable caching of documents and text in the index, so dtSearch can highlight hits from the cached data. This ensures that you can search and browse results even if you cannot access the site.
2. If you have problems accessing the site using the Spider, try changing the "User agent identification" to Internet Explorer 6. Some sites vary their appearance based on the user's browser, and if the site does not recognize the user agent name, it may return incorrect pages or fail to respond.
Limiting the Spider
To limit the Spider to particular areas of the site, use the Filename Filters and Exclude Filters in the Update Index dialog box. A filter with a / will be matched against the complete URL, so a filter of */OnlyThisOne/* would limit the indexer to documents under the "OnlyThisOne" folder. The Spider will also obey any instructions in a robots.txt file on the web site or in a robots meta tag. For more information on robots.txt and robots meta tags, see http://www.robotstxt.org.

Starting page for web site
This is the first page dtSearch will request from the site to start the
crawl. Usually
this will be the home page of the web site.
Crawl depth
The crawl depth is number of levels into the web site dtSearch will reach
when looking for pages. When dtSearch indexes a web site, it starts
from the page you specify, indexes that page, and then looks for links
from that page to other pages on the site. For each of those pages,
it looks for links to still more pages. With a crawl depth of zero,
dtSearch would index only the starting page. With a crawl
depth of 1, dtSearch would index only pages that are directly linked to
the starting page.
Authentication settings
and Passwords...
If the site requires authentication, click Passwords...
to set up a username and password.
Allow Spider to access servers other than
the starting server
By default, the Spider will not follow links to servers other than the
starting server. For
example, if the start page for the crawl is www.dtsearch.com, the Spider
will not follow links to support.dtsearch.com. To
enable the Spider to follow links to other servers, check this box and
list the other servers to include. You
can use wildcards to specify the server names to match. For
example, *.dtsearch.com would match www.dtsearch.com, support.dtsearch.com,
and download.dtsearch.com.
Stop crawl after __ files
Use this setting to limit the number of pages the Spider should index on
this web site.
Stop crawl after __ minutes
Use this setting to limit the amount of time the Spider will spend crawling
pages on this web site.
Skip files larger than __ kilobytes
Use this setting to limit the maximum size of files that the Spider will
attempt to access.
Time to pause between page downloads
Requiring the Spider to pause between page downloads can reduce the effect
of indexing on the web server.
User agent identification
Some web sites behave differently depending on the web browser being used
to access them. For
these sites, you can use the User agent
identification to specify a user agent name (for example,
Internet Explorer 6) for the Spider to use, so the Spider will index the
same view of the web site that users see with a web browser.