Interface for the dataSourceToIndex member of IndexJob, for indexing non-file data sources such as databases.
An IndexJob provides two ways specify the text to index: by files (the FoldersToIndex, IncludeFilters, and ExcludeFilters properties) and by data source (the DataSourceToIndex property). Most commonly, the text exists in disk files, in which case you would specify the files to be indexed using folder names and include and exclude filters.
In some situations, however, the text to be indexed may not be readily available as disk files. For example, the text may exist as rows in a remote SQL database or in Microsoft Exchange message stores. To supply this text to the dtSearch indexing engine, you can create an object that accesses the text and then attach the object to an IndexJob as the DataSourceToIndex property.
The dtSearch Engine will call the GetNextDoc method of your DataSource implementation to obtain documents to index. On each call, dtSearch will use the properties supplied (DocName, DocModifiedDate, DocFields, DocBytes, etc.) to set up a document object to index.
On each call to GetNextDoc, the DocTypeId, DocId, and DocWordCount properties will be filled in with the results of the previous document indexed. This enables the calling application to know the file type and document id assigned to each document after it has been indexed. (The document id is a unique integer identifying each document in an index, and can be used in SearchFilter objects to limit searches to a subset of the documents in the index.)
If the IndexingFlags.dtsAlwaysAdd flag is not set in the IndexJob, the DocName and DocModifiedDate will be used to determine whether the document is already in the index with the same date, and, if so, the document will not be reindexed. In this case, the DocTypeId, DocId, and DocWordCount properties will be set to the values assigned when the document was originally indexed.
When using the multithreaded DataSource API, the indexer will index all documents returned from GetNextDoc even if they have not changed since the last time they were indexed, so to prevent redundant indexing, the indexing application should only return new or modified documents from the DataSource.
The IncludeFilters and ExcludeFilters in IndexJob do not apply to content returned from a data source.
The DocFields property lets you add meta-data to the document text. Fields can be searchable or non-searchable, and can be designated as "stored" so they will be returned as document properties in search results (for example, to store a row id for easy access after a search). Field names can also include nesting, so instead of just "Author" or "Subject" you could use "Meta/Author" and "Meta/Subject".
The DocText property can be used to add plain-text content to the document. DocText is assumed to be text only, so if it contains text-like data such as RTF, HTML, or MIME-encoded email, the tags will be indexed as plain text rather than interpreted as RTF, HTML, or MIME.
Overview - Indexing Databases in dtSearchApiRef.chm
Topic |
Description |
The following tables list the members exposed by DataSource. | |
The methods of the DataSource class are listed here. | |
The properties of the DataSource class are listed here. |
DataSource Methods |
Description |
Get the next document from the data source. | |
Initialize the data source so the next GetNextDoc call will return the first document. |
DataSource Properties |
Description |
Use DocBytes to provide an array of bytes for dtSearch to use as the binary contents of this document. | |
The date that the document was originally created. | |
The DocDisplayName is a user-friendly version of the filename, which the dtSearch end-user product displays in search results. | |
If WasDocError is true, DocError will contain a string providing details on the nature of the error. | |
In DocFields, supply any fielded data you want the dtSearch Engine to index. | |
Each time GetNextDoc() is called, DocId will contain the doc id of the previous document. | |
The date that the document was last modified. | |
The DocName is the name of the document, as you want it to appear in search results. | |
Use DocStream to provide access to binary document data for this document in the data source. | |
In DocText, supply the text you want the dtSearch Engine to index. | |
Each time GetNextDoc() is called, DocTypeId will return an integer identifying the file type of the previous document. | |
Each time GetNextDoc() is called, DocWordCount will contain the number of words in the previous document. | |
Each time GetNextDoc() is called, WasDocError will be true if there was an error processing the previous document (such as a file parsing error) |
Copyright (c) 1998-2023 dtSearch Corp. All rights reserved.
|