Close
dtSearch Text Retrieval Engine Programmer's Reference
dtsLaJob Structure

Passed to the language analyzer with a series of blocks of text to analyze

File: dts_la.h

Syntax
C++
struct dtsLaJob { const wchar_t * inputBuffer; long inputTextLength; long flags; const wchar_t * outputBuffer; long outputTextLength; dtsLaWordInfo * wordTable; long wordCount; void * pData; dtsFileInfo * pFileInfo; const char * searchRequestPunct; const char * alphabetLocation; const char * indexRetrievedFrom; };

When the pAnalyze function in a language analyzer is called with a dtsLaJob to analyze, the language analyzer first decides whether it will process the text or not. If pAnalyze returns false, dtSearch will apply its internal word breaker to the text. 

If pAnalyze returns true, then dtSearch will use information in the outputBuffer and wordTable to determine where word breaks occur in the input text, and what text should be indexed for each word. 

In each wordTable entry, the offsetInInputBuffer and lengthInInputBuffer specify the range of text in the input buffer that is associated with a word. The outputBuffer for the dtsLaJob contains the text as modified by the language analyzer, and the offsetInOutputBuffer and lengthInOutputBuffer for each wordTable entry specify the word to index at each word position. 

Multiple words can be generated in the output for a single word in the input. To do this, set the flag dtsLaUsePreviousWordOffset in the flags member of the wordTable entry for a word. 

Words in the input cannot overlap, and cannot be in different order from the input. dtSearch assumes that the values of offsetInInputBuffer will be in increasing order, and the text ranges specified by the offsetInInputBuffer and lengthInInputBuffer will not overlap. 

For example, suppose the input consists of the following:

abc123def456

The language analyzer could partition the text as follows:

abc, 123, def, 456

Additionally, the language analyzer could return multiple words for each word position. For example, the "def" text in the input could be associated with three different words in the output, "def", "def2", and "def3", all at word offset 3. The resulting word table would be as follows:

offsetInInputBuffer
lengthInInputBuffer
Word in input
Word in output
0
3
abc
abc
0
3
3
123
123
0
6
3
def
def
0
6
3
def
def2
dtsLaUsePreviousWordOffset
6
3
def
def3
dtsLaUsePreviousWordOffset
9
3
456
456
0