dtsLaJob Structure

Passed to the language analyzer with a series of blocks of text to analyze

File

File: dts_la.h

Syntax

C++

struct dtsLaJob { long structSize; const wchar_t * inputBuffer; long inputTextLength; long flags; const wchar_t * outputBuffer; long outputTextLength; dtsLaWordInfo * wordTable; long wordCount; void * pData; dtsFileInfo * pFileInfo; const char * searchRequestPunct; const char * alphabetLocation; const char * indexRetrievedFrom; };

Data Members

Data Member	Description
alphabetLocation	Location of the alphabet file to use for word breaking.
flags	LanguageAnalyzerJobFlags values.
indexRetrievedFrom	Path to the index that this document was found in, if the document was retrieved in a search
inputBuffer	Text to be processed.
inputTextLength	Length of the text to be processed, in characters
outputBuffer	Pointer to text after morphological analysis. This pointer must remain valid until the next block of data is analyzed, or until the dtsLaJob is destroyed.
outputTextLength	Length of the output text, in characters.
pData	Pointer to instance data allocated by the pInitializeJob() function, and to be released by the pDestroyJob function.
pFileInfo	Information about the document that is the source of the text, if the input is from a document
searchRequestPunct	Search request punctuation characters to preserve in the output, when dtsLaInputIsSearchRequest is set in flags
structSize	Initialize to sizeof(dtsLaJob)
wordCount	Number of items in wordTable
wordTable	Table of structures providing, for each word in the inputBuffer, the offset and length in the inputBuffer, and the offset and length in the outputBuffer, of the word.

Group

Language Analyzer API

Members

Data Members

Data Member	Description
alphabetLocation	Location of the alphabet file to use for word breaking.
flags	LanguageAnalyzerJobFlags values.
indexRetrievedFrom	Path to the index that this document was found in, if the document was retrieved in a search
inputBuffer	Text to be processed.
inputTextLength	Length of the text to be processed, in characters
outputBuffer	Pointer to text after morphological analysis. This pointer must remain valid until the next block of data is analyzed, or until the dtsLaJob is destroyed.
outputTextLength	Length of the output text, in characters.
pData	Pointer to instance data allocated by the pInitializeJob() function, and to be released by the pDestroyJob function.
pFileInfo	Information about the document that is the source of the text, if the input is from a document
searchRequestPunct	Search request punctuation characters to preserve in the output, when dtsLaInputIsSearchRequest is set in flags
structSize	Initialize to sizeof(dtsLaJob)
wordCount	Number of items in wordTable
wordTable	Table of structures providing, for each word in the inputBuffer, the offset and length in the inputBuffer, and the offset and length in the outputBuffer, of the word.

Methods

Method	Description
dtsLaJob	Constructor

Methods

Method	Description
dtsLaJob	Constructor

Remarks

When the pAnalyze function in a language analyzer is called with a dtsLaJob to analyze, the language analyzer first decides whether it will process the text or not. If pAnalyze returns false, dtSearch will apply its internal word breaker to the text.

If pAnalyze returns true, then dtSearch will use information in the outputBuffer and wordTable to determine where word breaks occur in the input text, and what text should be indexed for each word.

In each wordTable entry, the offsetInInputBuffer and lengthInInputBuffer specify the range of text in the input buffer that is associated with a word. The outputBuffer for the dtsLaJob contains the text as modified by the language analyzer, and the offsetInOutputBuffer and lengthInOutputBuffer for each wordTable entry specify the word to index at each word position.

Multiple words can be generated in the output for a single word in the input. To do this, set the flag dtsLaUsePreviousWordOffset in the flags member of the wordTable entry for a word.

Words in the input cannot overlap, and cannot be in different order from the input. dtSearch assumes that the values of offsetInInputBuffer will be in increasing order, and the text ranges specified by the offsetInInputBuffer and lengthInInputBuffer will not overlap.

For example, suppose the input consists of the following:

abc123def456

The language analyzer could partition the text as follows:

abc, 123, def, 456

Additionally, the language analyzer could return multiple words for each word position. For example, the "def" text in the input could be associated with three different words in the output, "def", "def2", and "def3", all at word offset 3. The resulting word table would be as follows:

offsetInInputBuffer	lengthInInputBuffer	Word in input	Word in output	flags
0	3	abc	abc	0
3	3	123	123	0
6	3	def	def	0
6	3	def	def2	dtsLaUsePreviousWordOffset
6	3	def	def3	dtsLaUsePreviousWordOffset
9	3	456	456	0