Indexing COM Data Sources

How to index data from COM object interfaces such as ADO, CDO, etc.

Remarks

Overview

An IndexJob (IIndexJob) provides two ways specify the text you want to index: by files (the toAdd* properties) and by data source (the DataSourceToIndex property). Most commonly, the text exists in disk files, in which case you would specify the files to be indexed using folder names and include and exclude filters. In some situations, however, the text to be indexed may not be readily available as disk files. For example, the text may exist as rows in a remote SQL database or in Microsoft Exchange message stores. You could copy the text from the database to local disk files and index the local disk files, but the dtSearch Engine provides a more direct and efficient solution. To supply this text to the dtSearch indexing engine, you create an object that accesses the text and then attach the object to an IndexJob as the DataSourceToIndex property.

Interface

The DataSourceToIndex property is a Visual Basic object (or other IDispatch object), that implements the following methods and properties:

Method	Purpose
Rewind	Initialize the data source so that the next GetNextDoc call will return the first document. Rewind returns 0 if it succeds, non-zero if the data source is empty.
GetNextDoc	Get the next document from the data source. The document information is stored in the properties. GetNextDoc returns 0 if it succeeds, non-zero if there are no more documents.

Property	Purpose
DocName	The DocName is the name of the document, as you want it to appear in search results. This can be any legal Win32 filename.
DocIsFile	If True, DocName will be interpreted as the name of a file to be indexed, and dtSearch will index the contents of the file along with any data provided in DocText and DocFields. The DocModifiedDate will still be used as the modification date of the document.
DocDisplayName	The DocDisplayName is a user-friendly version of the filename, which the dtSearch end-user product displays in search results. If blank, the DocName will be used.
DocModifiedDate	The date that the document was last modified.
DocText	In DocText, supply the text you want the dtSearch Engine to index. This must be plain text that dtSearch can index directly. If you need to index text that is in a document, you can use a FileConverter to convert the document data to plain text.
DocFields	In DocFields, supply any fielded data you want the dtSearch Engine to index. DocFields consists of a series of pairs of field names and values, with tab characters (chr$(9)) between them.
DocId,DocWordCount, DocTypeId	Properties that indicate how the previously-returned document was indexed (see "Returned Properties" below)

Highlighting Hits

After a search that retrieves a document that was returned from the DataSourceToIndex object, you can generate a hit-highlighted version of the document using FileConverter's InputText, InputFields, and InputFile properties.

To do this, set up the FileConverter with the same data supplied by your data source. Set the FileConverter's InputFields property to the value of DocFields, the InputText property to the value of DocText, and, if DocIsFile was set to true, set the InputFile property to the value of DocName. For sample code, see the dsdemo Visual Basic sample.

Field Names

By default, field names are searchable along with field text. For example, if DocFields contains SampleField<TAB>Some Text, then you can find the document in a search either for "SampleField contains Text" or just "SampleField". To prevent a field name from being searchable, add * (asterisk) in front, like this:

*SampleField<TAB>Some Text

When a field name begins with *, only the text of the field is searchable, but not the name. Therefore, you can find the document in a search for "SampleField contains Text" but not by searching for just "SampleField". The * is not considered part of the field name for purposes of searching or designating stored fields. For sample code demonstrating non-searchable field names, see the dsdemo Visual Basic sample.

When a field name begins with **, the field is considered a "hidden stored" field. The contents of a hidden stored field are not searchable at all, and are automatically stored in the index as document properties when the document is indexed. To retrieve the value of a hidden stored field after a search, use SearchResults.DocDetailItem("FieldName").

Field names can include nesting. Example: Meta/Subject<TAB> This is the subject<TAB>Meta/Author<TAB> This is the author

In this example, you could search across both fields by searching for "Meta contains (something)", or you could search for "Author contains (something)", or you could search for "Meta/Author contains (something)" to distinguish this Author field from any other Author fields that might be present in the document. For more information on searching for nested fields, see: Field Searching

Returned Properties

Each time GetNextDoc is called, the following properties will provide status information about how the previously-returned document was indexed:

Property	Purpose
DocId	An integer that identifies this document in the index. DocId values can be used with the SearchFilter object.
DocWordCount	The number of words that were indexed in this document.
DocTypeId	An integer identifying the file format of the document

For sample code demonstrating the use of returned properties, see the dsdemo sample application.

Module

COM Interface

Example

This example demonstrates indexing of a database table using Microsoft Active Data Objects. The example assumes that the RecordSet will be created using a SELECT statement, which is not shown.

' These are the public properties that the dtSearch Engine  will use to index each document.
' Rewind() and GetNextDoc() will set them up.
Public DocName As String
Public DocDisplayName As String
Public DocModifiedDate As Variant
Public DocCreatedDate As Variant
Public DocText As String
Public DocFields As String
Dim iRow As Integer
Dim rs As ADODB.Recordset

' Rewind() initializes the data source so that the next GetNextDoc() call will return
' the first document in the data source. This assumes that rs is an open ADO RecordSet.
Public Sub Rewind()
    rs.MoveFirst
End Sub

' GetNextDoc() returns -1 on EOF (the end of the RecordSet).  If there is another
' record to index, it calls GetRowInfo to transfer the row  data to the public properties
' and returns 0, indicating that there is a new document  to index.

Public Function GetNextDoc() As Long
    If (rs.EOF) Then
        GetNextDoc = -1 ' no more documents
    Else
        GetRowInfo
        rs.MoveNext
        iRow = iRow  + 1
        GetNextDoc  = 0
    End If
End Function

Private Sub GetRowInfo()
    Dim fields As ADODB.fields
    Set fields = rs.fields

    ' In this example, the DocName is  constructed from the TableName and
    ' a field that will be used as the  Row Id, iRowIdField. The format
    ' of each DocName will be: TableName#FieldName=Value

    DocName = TableName " + "#" + fields(iRowIdField).Name + "=" +_
            fields(iRowIdField).Value
    DocModifiedDate = Now
    DocCreatedDate = Now
    ' Store the field name and value in DocText in the format FieldName = Value,
    ' and also store the field names and values in the DocFields string.
    ' The dtSearch Engine will index DocText as non-fielded, plain text.
    ' It will index the contents of DocFields as fielded data, so that
    ' it will be possible to search for "FieldName contains FieldValue".
    ' (The example is redundant since there is no need to supply anything
    ' in DocText if all of the data is present in DocFields.)
    DocText = ""
    DocFields = ""
    Dim f As ADODB.Field
    Dim i As Integer
    Dim fieldValue As String
    For i = 1 To fields.Count
        Set f = fields(i - 1)
        DocText = DocText & f.Name & " = " & f.Value & Chr$(13) & Chr$(10)
        ' DocFields is a series of {fieldname, fieldValue} pairs delimited with the chr$(9) (tab)
        ' character. To contruct this, we must make sure that the field text does not contain
        ' any tab characters (here we convert them to spaces) fieldValue = f.Value
        While InStr(fieldValue, Chr$(9)) > 0
            Dim iTab
            iTab = InStr(fieldValue, Chr$(9))
            Mid$(fieldValue, iTab, 1) = " "
        Wend
        DocFields = DocFields & f.Name & Chr$(9) & fieldValue & Chr$(9)
    Next i
End Sub