Passed to the language analyzer with a series of blocks of text to analyze
struct dtsLaJob { const wchar_t * inputBuffer; long inputTextLength; long flags; const wchar_t * outputBuffer; long outputTextLength; dtsLaWordInfo * wordTable; long wordCount; void * pData; dtsFileInfo * pFileInfo; const char * searchRequestPunct; const char * alphabetLocation; const char * indexRetrievedFrom; };
dts_la.h
When the pAnalyze function in a language analyzer is called with a dtsLaJob to analyze, the language analyzer first decides whether it will process the text or not. If pAnalyze returns false, dtSearch will apply its internal word breaker to the text.
If pAnalyze returns true, then dtSearch will use information in the outputBuffer and wordTable to determine where word breaks occur in the input text, and what text should be indexed for each word.
In each wordTable entry, the offsetInInputBuffer and lengthInInputBuffer specify the range of text in the input buffer that is associated with a word. The outputBuffer for the dtsLaJob contains the text as modified by the language analyzer, and the offsetInOutputBuffer and lengthInOutputBuffer for each wordTable entry specify the word to index at each word position.
Multiple words can be generated in the output for a single word in the input. To do this, set the flag dtsLaUsePreviousWordOffset in the flags member of the wordTable entry for a word.
Words in the input cannot overlap, and cannot be in different order from the input. dtSearch assumes that the values of offsetInInputBuffer will be in increasing order, and the text ranges specified by the offsetInInputBuffer and lengthInInputBuffer will not overlap.
For example, suppose the input consists of the following:
abc123def456
The language analyzer could partition the text as follows:
abc, 123, def, 456
Additionally, the language analyzer could return multiple words for each word position. For example, the "def" text in the input could be associated with three different words in the output, "def", "def2", and "def3", all at word offset 3. The resulting word table would be as follows:
|
offsetInInputBuffer |
lengthInInputBuffer |
Word in input |
Word in output | |
|
0 |
3 |
abc |
abc |
0 |
|
3 |
3 |
123 |
123 |
0 |
|
6 |
3 |
def |
def |
0 |
|
6 |
3 |
def |
def2 |
dtsLaUsePreviousWordOffset |
|
6 |
3 |
def |
def3 |
dtsLaUsePreviousWordOffset |
|
9 |
3 |
456 |
456 |
0 |
|
Data Member |
Description |
|
Location of the alphabet file to use for word breaking. | |
|
LanguageAnalyzerJobFlags values. | |
|
Path to the index that this document was found in, if the document was retrieved in a search | |
|
Text to be processed. | |
|
Length of the text to be processed, in characters | |
|
Pointer to text after morphological analysis. | |
|
Length of the output text, in characters. | |
|
Pointer to instance data allocated by the pInitializeJob() function, and to be released by the pDestroyJob function. | |
|
Information about the document that is the source of the text, if the input is from a document | |
|
Search request punctuation characters to preserve in the output, when dtsLaInputIsSearchRequest is set in flags | |
|
Number of items in wordTable | |
|
Table of structures providing, for each word in the inputBuffer, the offset and length in the inputBuffer, and the offset and length in the outputBuffer, of the word. |
|
Data Member |
|
Method |
|
Copyright (c) 1995-2012 dtSearch Corp. All rights reserved.
|