Last Reviewed: February 23, 2002
Article: DTS0114
Applies to: dtSearch
Symptoms
- A search for a phrase like "first class mail" finds the search words but not as part of the phrase
- A search for a boolean expression like "apple and pear" finds "apple" even when "pear" is not in the document
- A search for a phrase like "this is a house" finds all instances of "house"
Possible Causes
(1) Natural language searching is selected instead of boolean searching
(2) The search request contains noise words
(3) Some of the words found are in hidden document properties
Natural Language Searching
In a natural language search, all "boolean" connectors like AND, OR, NOT, W/7, etc. are ignored in your search request. Also, all phrases like New York are broken up into their constituent words, so it is as if you are searching for two separate words: New and York.
dtSearch does this because the purpose of natural language searching is to find a collection of documents, ranked by full relevancy, that best matches your search request. So, for example, if you are searching for: Get me Tom's memo on the 1998 takeover of CorpX by Megahugecorp, and 1998 appears in 30,000 documents and Megahugecorp only appears in three documents, the documents containing Megahugecorp would be ranked much higher, bringing you that much closer to your memo.
Technically, weighting of retrieved documents takes into account: the number of documents each word in your search request appears in (the more documents a word appears in, the less useful it is in distinguishing relevant from irrelevant documents); the number of times each word in the request appears in the documents; and the density of hits in each document. For purposes of this relevancy ranking, and hence for purposes of your search request, noise words, as well as search connectors like NOT and OR, are ignored. All words are treated as "individuals," ignoring phrases.
Noise Words
dtSearch ignores "noise" words like "if" or "the" when searching. Therefore, if you search for a phrase like "This is a house", dtSearch will ignore "this", "is", and "a", and just search for "house". If you need to search for noise words in your documents, you can build your indexes without a noise word list. To do this, click Options > Preferences > Indexing Options in dtSearch Desktop, edit the noise word list, and either delete all of the words or delete only the ones that you want to be able to search for.
After you change the noise word list, you will need to delete and rebuild your indexes, because each index keeps its own copy of the noise word list when the index is created.
Hidden Document Properties
Most documents contain a document "Summary Information" or "Properties" area that is not normally displayed but that is still searchable. These fields may contain some of the words matched by your search. To find the document properties, open the document in the application that created it, and look in the File menu for a Document Properties or Document Summary Information item. For example, to find the properties of an Adobe Acrobat (PDF) file, click File > Document Properties > Summary in Adobe Reader.
You can usually see the document properties in dtSearch Desktop as well. This chart explains how to find document properties in commonly-used file types:
|
File Type |
How to find Document Properties |
|---|---|
|
|
Open the PDF file in dtSearch or in Adobe Reader, and press Ctrl+D |
|
DOC, XLS, PPT, WPD |
dtSearch will list the document properties at the end of the document text. In Word, Excel, PowerPoint, or WordPerfect, click File > Document Summary Information. |
|
HTML |
Open the HTML file in dtSearch or in your web browser, right-click in the document window and select "View Source". Document properties will appear in "META" tags, like this: <META name="Keywords" content="Apple Pear Grape Banana"> |