Passed to the language analyzer with a series of blocks of text to analyze
File: dts_la.h
Data Member |
Description |
---|---|
Location of the alphabet file to use for word breaking. | |
LanguageAnalyzerJobFlags values. | |
Path to the index that this document was found in, if the document was retrieved in a search | |
Text to be processed. | |
Length of the text to be processed, in characters | |
Pointer to text after morphological analysis. This pointer must remain valid until the next block of data is analyzed, or until the dtsLaJob is destroyed. | |
Length of the output text, in characters. | |
Pointer to instance data allocated by the pInitializeJob() function, and to be released by the pDestroyJob function. | |
Information about the document that is the source of the text, if the input is from a document | |
Search request punctuation characters to preserve in the output, when dtsLaInputIsSearchRequest is set in flags | |
Number of items in wordTable | |
Table of structures providing, for each word in the inputBuffer, the offset and length in the inputBuffer, and the offset and length in the outputBuffer, of the word. |
Data Member |
Description |
---|---|
Location of the alphabet file to use for word breaking. | |
LanguageAnalyzerJobFlags values. | |
Path to the index that this document was found in, if the document was retrieved in a search | |
Text to be processed. | |
Length of the text to be processed, in characters | |
Pointer to text after morphological analysis. This pointer must remain valid until the next block of data is analyzed, or until the dtsLaJob is destroyed. | |
Length of the output text, in characters. | |
Pointer to instance data allocated by the pInitializeJob() function, and to be released by the pDestroyJob function. | |
Information about the document that is the source of the text, if the input is from a document | |
Search request punctuation characters to preserve in the output, when dtsLaInputIsSearchRequest is set in flags | |
Number of items in wordTable | |
Table of structures providing, for each word in the inputBuffer, the offset and length in the inputBuffer, and the offset and length in the outputBuffer, of the word. |
Method |
Description |
---|---|
Constructor |
Method |
Description |
---|---|
Constructor |
When the pAnalyze function in a language analyzer is called with a dtsLaJob to analyze, the language analyzer first decides whether it will process the text or not. If pAnalyze returns false, dtSearch will apply its internal word breaker to the text.
If pAnalyze returns true, then dtSearch will use information in the outputBuffer and wordTable to determine where word breaks occur in the input text, and what text should be indexed for each word.
In each wordTable entry, the offsetInInputBuffer and lengthInInputBuffer specify the range of text in the input buffer that is associated with a word. The outputBuffer for the dtsLaJob contains the text as modified by the language analyzer, and the offsetInOutputBuffer and lengthInOutputBuffer for each wordTable entry specify the word to index at each word position.
Multiple words can be generated in the output for a single word in the input. To do this, set the flag dtsLaUsePreviousWordOffset in the flags member of the wordTable entry for a word.
Words in the input cannot overlap, and cannot be in different order from the input. dtSearch assumes that the values of offsetInInputBuffer will be in increasing order, and the text ranges specified by the offsetInInputBuffer and lengthInInputBuffer will not overlap.
For example, suppose the input consists of the following:
The language analyzer could partition the text as follows:
Additionally, the language analyzer could return multiple words for each word position. For example, the "def" text in the input could be associated with three different words in the output, "def", "def2", and "def3", all at word offset 3. The resulting word table would be as follows:
offsetInInputBuffer |
lengthInInputBuffer |
Word in input |
Word in output | |
0 |
3 |
abc |
abc |
0 |
3 |
3 |
123 |
123 |
0 |
6 |
3 |
def |
def |
0 |
6 |
3 |
def |
def2 |
dtsLaUsePreviousWordOffset |
6 |
3 |
def |
def3 |
dtsLaUsePreviousWordOffset |
9 |
3 |
456 |
456 |
0 |