Close
dtSearch Text Retrieval Engine Programmer's Reference
Language Analyzer API

Integrate an external language analyzer with the dtSearch Engine, to implement morphological analysis or word breaking.

The Language Analyzer API provides a way to add customized word breaking and morphological analysis to the dtSearch Engine. A typical use of the Language Analyzer API would be to add dictionary-based word breaking for Japanese or Chinese text to an application that uses the dtSearch Engine API. 

A language analyzer, once registered with the dtSearch Engine, will be called during indexing or text extraction to analyze blocks of text. For each block of text, the analyzer returns (a) a tableH indicating where the word breaks occur in the text, and (b) a modified version of the text, with any morphological analysis applied to the words. 

The language analyzer will also be called during a search to process the search request. If the language analyzer returns multiple words in a word position in the output, only the first word in that word position will be used in the search request. (In other words, the language analyzer output does not currently function like a thesaurus at search time.) 

To determine whether the input data is a search request or document data, the language analyzer can check the LanguageAnalyzerJobFlags in dtsLaJob.flags.

Implementing a Language Analyzer

An implementation of a Language Analyzer requires two classes: 

(1) The first class encapsulates data that must be initialized once per process, such as large dictionaries, and that can be shared among multiple threads. This class, based on CLanguageAnalyzerBase, must implement a virtual function, makeAnalyzerInstance(), which allocates and returns an instance of the second class. 

(2) The second class, based on CLanguageAnalyzerJob, implements a virtual function, analyze(), which analyzes a block of data. 

These classes and related structures are provided in examples\cpp\include\dts_la.h and examples\cpp\common\dts_la.cpp. 

For an example demonstrating use of the Language Analyzer API, see the examples\cpp\LanguageAnalyzer folder. A CSampleLanguageAnalyzer class that can be used as a starting point for development is in la_sample.cpp in this folder. 

Additional sample code demonstrating use of the Language Analyzer API to integrate with a morphological analyzer from Basis Technologies (www.basistech.com) can be obtained by contacting dtSearch Corp. (Because the Basis Technologies product's API is covered by a nondisclosure agreement, we can only provide this sample code to Basis Technologies customers.)

Registering a Language Analyzer

The dtsLanguageAnalyzerInterface structure is used to register a language analyzer with the dtSearch Engine. To register a language analyzer with the dtSearch Engine, 

(1) Initialize a dtsLanguageAnalyzerInterface

(a) Call CLanguageAnalyzerBase::makeInterface to set up most of the function pointers in the interface 

(b) Set the pCreateAnalyzer function pointer to a static function that allocates and returns an instance of your CLanguageAnalyzerBase-derived class. 

(2) Set the pAnalyzer member of dtsOptions to point to the dtsLanguageAnalyzerInterface, and 

(3) call dtssSetOptions

A language analyzer can also be made into a DLL that the dtSearch Engine will register automatically. This makes it possible to add a language analyzer to dtSearch Desktop. The DLL must go in the "viewers" folder under the dtSearch Engine "Home" directory, and must export this function: 

extern "C" { 

__declspec(dllexport) BOOL GetLanguageAnalyzer(dtsLanguageAnalyzerInterface& la); 

} 

GetLanguageAnalyzer fills in the dtsLanguageAnalyzerInterface and returns TRUE to register a language analyzer. 

Highlighting Hits 

To ensure consistent hit highlighting, a language analyzer must be invoked when highlighting hits after a search and must behave the same way it behaves during indexing. 

PDF hit highlighting inside Adobe Reader does not currently work with the language analyzer API. The only kind of hit highlighting that is supported in combination with the language analyzer API is conversion of files using FileConverter.