Close
dtSearch .NET Standard API 2023.02
FileConverter Class

Convert files to HTML, RTF, XML, or text, optionally marking hits with caller-supplied tags.

dtSearch.Engine.FileConverter
public class FileConverter : OutputBase;

FileConverter converts files to HTML, RTF, XML, or text, optionally marking hits with caller-supplied tags.

Highlighting Hits

Most commonly, FileConverter is used after a search to highlight hits in a retrieved document. To highlight hits in a document, FileConverter needs:

  1. The input document.
  2. The word offsets of the hits returned from search results
  3. The location of the alphabet file to use for word breaking.
  4. The location of the index the document was found in.
  5. The document id of the document in the index.
  6. The output format (HTML, RTF, XML, text)
  7. Tags to insert around each hit.
  8. The location of the output to create.

The first five items all come from the SearchResults object with the results of the search, so you can set them all in a single step by calling FileConverter.SetInputItem() with the SearchResults object and the ordinal of the document to select. 

SetInputItem will set InputFile, InputTypeId, InputDocId, Hits, AlphabetLocation, and IndexRetrievedFrom. If the index was built with caching of documents, SetInputItem will also set up FileConverter to retrieve the cached version of the document from the index. 

Conversion Input

The document data to convert can consist of one binary document file, such as a Word document, and any number of field-value pairs in InputFields. InputText can be used to provide additional text to include in the converted output. 

You can pass the binary document to FileConverter in several ways:

  • To get the document from a disk file, set InputFile to the name of the file.
  • To pass the document as a stream of bytes, set InputBytes to an array of bytes containing the document data.
  • To pass the document as a .NET Stream object (such as a FileStream), set InputStream to the Stream object to use.

InputText and InputFields may only contain plain text. If HTML, RTF, or other text-like document data is passed in InputText, the HTML or RTF tags will be interpreted as text and included in the conversion output. 

InputFile must be an accessible disk file. UNC paths will work, provided that the network resource can be accessed, but HTTP paths will not. To convert data accessed by HTTP, download the data to a memory buffer and supply it in InputBytes or InputStream

Even when InputBytes or InputStream is used, a filename should be provided in InputFile if possible to tell dtSearch the original filename extension, which can provide useful information about the document format. 

Cached documents 

When you build an index, you can request that the documents be cached in the index, in which case dtSearch will zip-compress each document and store it in the index folder. This can be done with any type of indexed data, including dynamically-generated data returned through the DataSource API. To have FileConverter use the cached document as input, use SetInputItem to set up FileConverter as described above, and set the flag dtsConvertGetFromCache in FileConverter.Flags

DataSource input 

If the original data was indexed using the DataSource indexing API, then to highlight hits set InputBytes, InputFields, and InputText to the same values that were returned from the data source as DocBytes, DocFields, and DocText when the document was indexed. Alternatively, you can build the index with caching of documents enabled, and then use the cached document to highlight hits (see above).

Conversion Output

The BeforeHit and AfterHit markers are inserted before and after each hit word. The BeforeHit and AfterHit markers can contain hypertext links or other HTML tags. To facilitate creation of hit navigation markers, the strings "%%ThisHit%%", "%%NextHit%%", and "%%PrevHit%%" will be replaced with ordinals representing the current hit, the next hit, and the previous hit in the document. 

For more information on conversion output options, see: 

Highlighting hits - overview 

Conversion output formatting 

Recommended Flags for HTML Output

Set dtsConvertAutoUpdateSearch to have dtSearch automatically correct out-of-date hit highlighting information. 

Set dtsConvertRemoveScripts to disable JavaScript in HTML input documents. 

Set dtsConvertUseStyles to have CSS styles included in output, and add a style sheet based on the dtSearch DocStyles.css file to specify the appearance of each style.

IDisposable

FileConverter requires the IDisposable Pattern.

See also

Highlighting hits - overview 

Caching documents