Troubleshooting -- Spider forms authentication problems

Article: dts0214

Applies to: dtSearch Spider

The dtSearch Spider can index web sites that use forms authentication, by capturing the form variables and then re-sending the captured form variables to log in to a web site.

While this mechanism works for many web sites, web sites can be set up to prevent this mechanism from working.  This can be a beneficial feature because it prevents malicious users from intercepting your web login request and saving it to use later for their own purposes. However, it also completely blocks the dtSearch Spider from logging in.

If you are trying to index your own web site, the following are some ways to configure the web site so the dtSearch Spider can access it:

(1) Change the authentication form to allow authentication via the URL instead, like this:

https://www.example.com/login.aspx?user=abc&password=def

This way the start page for your crawl could embed the authentication information.

Users logging in using the web form would still have the full benefit of the secure form, but the Spider would be able to authenticate directly through the URL.

(2) Index the content in the folders where it occurs (through the file system) rather than by crawling the site. This will work if the content is documents like PDF files or static HTML pages, but not if it is dynamically generated.

(3) Change your authentication process to allow the Spider to bypass your login form in a way that does not compromise security (for example, you could allow this only if the IP address of the user matches the machine the Spider runs from).