Links
dtSearch Text Retrieval Engine Programmer's Reference 7.70
dtsLaJob Structure
Data Members | Language Analyzer API | Legend | Members | Methods | Send Feedback

Passed to the language analyzer with a series of blocks of text to analyze

struct dtsLaJob {
  const wchar_t * inputBuffer;
  long inputTextLength;
  long flags;
  const wchar_t * outputBuffer;
  long outputTextLength;
  dtsLaWordInfo * wordTable;
  long wordCount;
  void * pData;
  dtsFileInfo * pFileInfo;
  const char * searchRequestPunct;
  const char * alphabetLocation;
  const char * indexRetrievedFrom;
};
File

dts_la.h

Remarks

When the pAnalyze function in a language analyzer is called with a dtsLaJob to analyze, the language analyzer first decides whether it will process the text or not. If pAnalyze returns false, dtSearch will apply its internal word breaker to the text. 

If pAnalyze returns true, then dtSearch will use information in the outputBuffer and wordTable to determine where word breaks occur in the input text, and what text should be indexed for each word. 

In each wordTable entry, the offsetInInputBuffer and lengthInInputBuffer specify the range of text in the input buffer that is associated with a word. The outputBuffer for the dtsLaJob contains the text as modified by the language analyzer, and the offsetInOutputBuffer and lengthInOutputBuffer for each wordTable entry specify the word to index at each word position. 

Multiple words can be generated in the output for a single word in the input. To do this, set the flag dtsLaUsePreviousWordOffset in the flags member of the wordTable entry for a word. 

Words in the input cannot overlap, and cannot be in different order from the input. dtSearch assumes that the values of offsetInInputBuffer will be in increasing order, and the text ranges specified by the offsetInInputBuffer and lengthInInputBuffer will not overlap. 

For example, suppose the input consists of the following:

abc123def456

The language analyzer could partition the text as follows:

abc, 123, def, 456

Additionally, the language analyzer could return multiple words for each word position. For example, the "def" text in the input could be associated with three different words in the output, "def", "def2", and "def3", all at word offset 3. The resulting word table would be as follows:

offsetInInputBuffer 
lengthInInputBuffer 
Word in input 
Word in output 
abc 
abc 
123 
123 
def 
def 
def 
def2 
dtsLaUsePreviousWordOffset 
def 
def3 
dtsLaUsePreviousWordOffset 
456 
456 
Data Members
Data Member 
Description 
Location of the alphabet file to use for word breaking. 
Path to the index that this document was found in, if the document was retrieved in a search 
Text to be processed. 
Length of the text to be processed, in characters 
Pointer to text after morphological analysis. 
Length of the output text, in characters. 
Pointer to instance data allocated by the pInitializeJob() function, and to be released by the pDestroyJob function. 
Information about the document that is the source of the text, if the input is from a document 
Search request punctuation characters to preserve in the output, when dtsLaInputIsSearchRequest is set in flags 
Number of items in wordTable 
Table of structures providing, for each word in the inputBuffer, the offset and length in the inputBuffer, and the offset and length in the outputBuffer, of the word. 
Group
Methods
Method 
Description 
Initialize a dtsLaJob 
Constructor 
Legend
 
Data Member 
 
Method 
Links
You are here: C++ API > Language Analyzer API > dtsLaJob Structure
Copyright (c) 1995-2012 dtSearch Corp. All rights reserved.