Caching documents

Storing documents and document text in an index.

Remarks

dtSearch indexes can cache documents in either, or both, of two ways: (1) the entire original file can be cached, or (2) just the text of the file can be cached. Cached documents are stored using ZIP compression.

The benefit of caching documents is faster and easier highlighting of hits. This is especially true when the index was created using the dtSearch Spider or the "Data Source" indexing API. In these types of indexes, the document names in the index do not correspond to local disk files, so access to the original document may be slow or even impossible. With cached document text in the index, dtSearch can generate hit-highlighted document displays and search reports with no need to access the original data. Because of these benefits, both types of caching (text and original document) are recommended for indexes created using the dtSearch Spider.

How to enable caching

To enable caching, create an index with the dtsIndexCacheText and/or the dtsIndexCacheOriginalFile flag set in IndexJob.IndexingFlags. These flags must be set when an index is created and will have no effect when an index is updated, because caching has to be built into the index structure when the index is first created.

You can also create indexes with caching enabled using dtSearch Desktop. To do this, run dtSearch Index Manager, click "Create Index (Advanced)", and check the boxes for the types of caching to enable.

Using cached text - highlighting hits

To display the original file with hits highlighted, caching of the original file is best so formatting can be preserved in the hit-highlighted display. (When HTML files are cached, only the HTML is stored.)

dtSearch Desktop and dtSearch Web will automatically use cached original documents as input for hit-highlighting if an index contains cached original documents.

To use a cached document as input for FileConverter,

(1) Set FileConverter.IndexRetrievedFrom to the index path,

(2) Set FileConverter.InputDocId to the document id of the file (which can be obtained from search results as DocDetailItem("_docId"))

(3) Set dtsConvertGetFromCache in FileConverter.Flags

Using cached text - synopsis/search reports

Caching documents in text form can make generation of search reports faster, especially generation of the synopsis in search results. The text is cached in small chunks as compressed UTF-8, so dtSearch can quickly locate the context around hits, even in long documents.

dtSearch Desktop and dtSearch Web will automatically use cached text as input for generation of search reports and the synopsis in search results.

To use cached documents as input for SearchReportJob, set the dtsReportGetFromCache flag in SearchReportJob.Flags.

Excluding DocFields from cached text

Fields added to documents using the DocFields property in the DataSource API will be included with the cached text, unless you set the dtsIndexCacheTextWithoutFields flag in IndexJob.IndexingFlags. Setting this flag will prevent DocFields values from appearing in search reports or the synopsis.

Performance implications of caching text

Caching text has no effect on search speed.

Caching text will make indexing slower due to the need to compress and store the text in the index. It will make the index larger due to the stored document data. Compression reduces the size of stored documents by about 70-80%.

Security implications of caching text

When a document is retrieved from the cache in an index, any security settings on the original file are not checked. Instead, only access to the index itself is checked. Therefore, a user who is able to search an index will also be able to access any cached documents stored in that index.

Group

Building and Maintaining Indexes