Troubleshooting -- Spider unable to index web site

Article: dts0215

Applies to: dtSearch Spider 

Symptom: Spider only sees start page on web site

1.  Check whether the site has a robots.txt file.   To check for robots.txt, enter the name of the web server followed by /robots.txt, like this:

https://www.microsoft.com/robots.txt

A robots.txt file instructs web site indexing spiders not to visit certain portions of the site.   The dtSearch Spider will obey any instructions found in robots.txt.  

2.  Check whether the start page for the site has a robots META tag, which can instruct indexing spiders not to follow links in a file.  

For more information on robots.txt and robots META tags, see: https://www.robotstxt.org

3.  Check the links on the web site to see if they go to different web servers.   For example, a link on the home page for www.example.com might point to support.example.com.  If the web site includes links to other servers that you want to index, check the box in the "Add web site to index" dialog box to "Allow the spider to access web servers other than the starting server", and list the other servers to include. You can use wildcards to specify the server names to match.  For example, *.dtsearch.com would match www.dtsearch.com, support.dtsearch.com, and download.dtsearch.com.  

Symptom: Web site returns a "500 Server Error" page

By default, the dtSearch Spider identifies itself in the user agent string as "dtSearchSpider".   Some web sites may return an error if the software that implements the web site is unable to handle user agent strings other than the common web browser values.  To index web sites that have this limitation, you can change the user agent string the dtSearch Spider uses to something standard such as Internet Explorer 6.  To do this,

1.  In dtSearch Desktop, click Index > Update Index, and select the index you are trying to update.

2.  Right-click the web site start page in the "What to index" list and select "Modify Web Site"

3.  Change the "User agent identification" at the bottom of the dialog box to Internet Explorer 6.

Symptom: Web site returns "Access Denied" errors (401, 403)

To set up the dtSearch Spider to use a password to access secure web sites, see Spider Passwords.

See also: Troubleshooting -- Spider forms authentication problems

Symptom: Spider skips pages that should be indexed

1.  Check for a robots.txt file on the web site that is excluding these pages.

2.  Enable the option in Options > Preferences > Spider Options to "Log the links found on each page in spiderlog.txt", and review spiderlog.txt in a text editor after indexing the site.   Spiderlog.txt will contain a list, for every page indexed, of the links found on each page and how the Spider interpreted the links.   To determine why a page was not indexed, search Spiderlog.txt for the name of the page that links to the page, and then check the list of links to see why the Spider did not find the page.

You can also use Spiderlog.txt to determine how the Spider reached a page that you did not expect to be indexed.