Last Reviewed: July 22, 2005

Article: DTS0184

Applies to: dtSearch Engine 6.1 and later

Contents

How to sort search results using the dtSearch Engine

How to sort search results using dtSearch Web

Summary of Search Flags

Relevance

Sorting by a Combination of Values

Sorting vs. Selection

How to sort search results using the dtSearch Engine

After a search, search results are returned sorted by relevance.  You can use the flags below to re-sort search results by other criteria.  In Visual Basic, use the Sort() method in SearchResults.  In  C++, use the sort() function in dtsSearchResults or DSearchResults.

The sort function has two arguments: flags, and field.  The field is a string that is used when flags includes dtsSortByField.  A field used for sorting must be designated as a stored field during indexing so that the value will be available for sorting.

Another way to sort by a custom sort key is to use the SearchResults.SetSortKey() method to assign a sort key to each item in the search results list.  After each key is assigned, sort using the dtsSortBySortKey flag.

How to sort search results using dtSearch Web

In dtSearch Web, the sort criterion is passed through the "sort" form variable.  The pageSize variable determines how many items are returned in each page, and the maxFiles variable determines how many items are retrieved in total.   

If sort is not "size", "name", "date", or "hits", then dtSearch Web will assume that the sort key is a stored field and will use the dtsSortByField search flag.  

The sort form variable can be followed by colon and a numerical value that will be combined with the sort type.  Example:  "subject:0x210002".  The numerical value is a combination of any of the sort flags described below, expressed as a hexadecimal integer.

hits

In a search that is sorted by hits, dtSearch Web will return up to maxFiles of the most relevant documents, organized into pages each with pageSize documents.  If pageSize is not specified in the search form, the maxFiles value will be used as the page size.

date

Sorting by date works like sorting by hits, except that the most recent documents are returned instead of the most relevant.

size, name, and custom fields

When sorting by criteria other than hits or date, dtSearch will return up to maxFiles of the most relevant files, organized into pages each with pageSize documents, with the entire results list sorted by the specified criteria.  For example, if the sort criterion is "size", pageSize is 10, and maxFiles is 100, dtSearch will find the 100 most relevant files (not the 100 largest), and will display them in pages of 10 documents, sorted by size.

Summary of Search Flags

 

Flag

Meaning

dtsSortByName (0x4)

Sort by filename (without path)

dtsSortByLocation (0x400)

Sort by the path of the file

dtsSortByDate (0x8)

Sort by modification date, including the time

dtsSortByTime (0x800)

Sort by modification time, ignoring the date

dtsSortByHitCount (0x400000)

Sort by number of hits

dtsSortByRelevanceScore (0x800000)

Sort by relevance score

dtsSortByHits (0x10)

Sort by hit count or score, depending on whether the automatic term weighting was used in the search

dtsSortBySize (0x20)

Sort by file size

dtsSortByField (0x40)

 

Sort by one of the elements in the userFields set.  The second argument userField identifies the field to use for sorting

dtsSortByIndex (0x80)

Sort by the index retrieved from

dtsSortByType (0x100)

Sort by file type

dtsSortByTitle (0x200)

Sort by the title string

dtsSortBySortKey (0x1000)

Sort by caller-specified sort key (use setSortKey for each item to specify the key)

dtsSortAscending (0x2)

Add dtsSortAscending to the flag to request an ascending-order sort. Otherwise the sort will be in descending order.

dtsSortNumeric (0x20000)

Sort by the numeric value of a field instead of its string value. This would cause "20" to be considered greater than "9". The sort key will be a signed, 32-bit integer.

dtsSortFloatNumeric (0x100000)

Sort by the floating point numeric value of a field instead of its string value. The sort key will be a double-precision floating point number.

dtsSortCleanText (0x200000)

Remove leading punctuation or white space from sort value before sorting.  Also removes "re:", "fw:", and "fwd:".

dtsSortCaseInsensitive (0x010000)

Make string comparisons in the sort case-insensitive

dtsSortPdfUseTitleAsName (0x040000)

dtsSortHtmlUseTitleAsName (0x080000)

When sorting by filename, use the PDF or HTML Title as the filename for PDF or HTML files.

Add dtsSortAscending to specify an ascending order (the default is descending order). Add dtsSortCaseInsensitive or dtsSortNumeric to modify a sort based on a string criterion.

Relevance

In relevancy-ranked searches, dtSearch uses a "vector-space" algorithm to calculate a score for each document that takes into account the relative frequency of the search terms and their density in the retrieved file.  Infrequent terms count more heavily than common terms, and N hits in a short document count more heavily than N hits in a long document.

An additional positional scoring option increases the score when hits occur close to each other or close to the top of the file.  With positional scoring, hits near the top of a file, and hits close to other hits, are weighted more highly.  For example, if you search for apple pie recipe, a document with those three words near the top of the file, all together, will rank higher than a file that has the words scattered randomly throughout the document.

In the dtSearch Engine API, use dtsSearchAutoTermWeight to enable the vector-space scoring, and dtsSearchPositionalScoring to enable the positional scoring.   Automatic term weighting and positional scoring can be combined, and using both is recommended for best results.  The default in dtSearch Web and dtSearch Desktop is to use both.

dtSearch Desktop offers the option to "Sort by relevance" or "Sort by hit count".   "Sort by relevance" uses both positional scoring and automatic term weighting.  Sorting by hit count sorts very simply by the total number of words that matched as hits (each word in a phrase counts separately, so "first class mail" would be three hits).

Variable term weighting

You can change the term weights for each term in your search request, like this:

apple:5 and pear:1

This request would retrieve the same documents as apple and pear but, dtSearch would count apple five times as heavily as pear when sorting the results.

Weights can also be applied to a field in a boolean seach, like this:

(description:5 contains (apple and pear)) or (author:2 contains ("John Smith"))

Sorting by a Combination of Values

To sort search results by a combination of values, use SearchResults.SetSortKey() to set the sort key for each item in search results, and then call SearchResults.Sort(dtsSortBySortKey, "") to sort by the sort key.

For example, suppose you want to sort file date and then by filename.  For each item in search results, you would generate a string combining these two values (example: "2005-07-22 Filename.doc"), and call SearchResults.SetSortKey to assign a generated sort key to each item.   Example (vbscript):

' Assign a sort key to each item in search results.

' MakeSortKey() is a function that generates the sort

' key from the properties of the search results item

' and returns a string

dim res
set res = sj.Results
for i = 0 to res.Count-1
    res.GetNthDoc(i)
    res.SetSortKey MakeSortKey(res, i)
next

Once every item in search results has been assigned a key, you can call sort(dtsSortKeySortkey).  For a complete example, see this sample included with the dtSearch Engine:

    C:\Program Files\dtSearch Developer\examples\vbs\sort.vbs

Sorting vs. Selection

When sorting by something other than hits or relevance, it is important to keep in mind the difference between sorting and selection.  In a search, dtSearch first selects the most relevant documents, using hit count or score to compare documents.   After this is done, the results can be sorted by filename, file size, etc.  

The difference between sorting and selection becomes significant when you are displaying search results in pages.   

For example, suppose you have a web searching application that displays search results in pages of 10 items.  To implement this, on each page there is a "Next page" link that points back to the searching script and that repeats the search, passing a variable that indicates which page of search results should be displayed.   For the first search, the script sets maxFilesToRetrieve = 10 and displays the first 10 hits.   For the second page, the script sets maxFilesToRetrieve=20 and displays the next 10 hits.  As long as results are being sorted by hit count or relevance, this will work, because the criteria used to select items in the search is the same as the one used to sort items after the search.  

Now suppose you try the same approach in a search that is sorted by filename.   When you click "Next Page" to get the second page of hits, you will see results that may overlap with the first page.   This is because the first page contains the 10 most relevant files, sorted by filename, while the second page contains the 20 most relevant files, sorted by filename.  Items 11-20 in the second search results set may contain items that were reported in the first results set.   For example, suppose the top-ranked document in search results is named "zzz.doc".   It will appear in both sets of search results, and when sorted by filename, it will appear at the end of the list.   This means it will appear as item #10 when you display the top 10 documents, sorted by filename, and it will appear as item #20 when you display items 11-20 from the top 20 documents, sorted by filename.

To avoid this problem, MaxFilesToRetrieve has to be much larger than the page size, so each page of search results will display a different range of items from the same sorted list.   If instead of setting MaxFilesToRetrieve to 10 for the first page, and 20 for the next page, you set it to 200 for every page, and just report a different range of items, then each page will be consistent.