Links
dtSearch Text Retrieval Engine Programmer's Reference 7.86
Recognition of Dates, Email Addresses, and Credit Card Numbers
Overviews | Send Feedback

dtSearch includes an option to automatically recognize dates, email addresses, and credit card numbers in text during indexing.

Remarks

Dates 

Date recognition looks for anything that appears to be a date, using English-language months (including common abbreviations) and numerical formats. Examples of date formats that are recognized include:

January 15, 2006
15 Jan 06
15 Jan 2006
1 January 2006
01 January 2006
2006/01/15
1/15/06
1-15-06
The fifteenth of January, two thousand six

 

To search for a date, put "date()" around the date expression or range. For example, to find any of the expressions above near the word "apple", search for:

    date(jan 15 2006) w/10 apple

 

To search for a range of dates near the word "apple", search for:

    date(jan 10 2006 to jan 20 2006) w/10 apple

 

A field search for a date expression would be expressed like a field search for a word:

    DateField contains date(jan 10 2006 to jan 20 2006)

 

Unterminated ranges are not supported, so to search for any date after or before a particular date, enter a bounded range with a maximal or minimal value for the bounds. The maximum value for a year is 2900, and the minimum value is 1000. Example:

    DateField contains date(jan 10 2006 to jan 1 2900)

 

Ambiguous date expressions like 01/02/03 are presumed to be MM/DD/YY. To change this presumption, use the TextFlags values dtsoTfRecognizeDatesPresumeDMY or dtsoTfRecognizeDatesPresumeYMD to specify an alternative presumption. 

 

Email Addresses 

Email address recognition looks for text that follows the syntax for a valid email address (example: sales@dtsearch.com). This makes it possible to search for a specific email address regardless of the alphabet settings for the @ and . characters, as well as any other punctuation that may be present in an email address. Also, this makes it possible to use the word listing functions in dtSearch to enumerate all email addresses in a document collection. 

To search for an email address, put "mail()" around the address. The * and ? wildcard expressions are supported inside the () marks. Examples:

    mail(sales@dtsearch.com)
    mail(s*@dtsearch.com)

 

Credit Card Numbers 

Credit card number recognition looks for any sequence of numbers that appears to satisfy the criteria for a valid credit card number issued by one of the major credit card issuers. Credit card numbers are recognized regardless of the pattern of spaces or punctuation embedded in the number. Examples:

    1234-5678-1234-5678
    1234567812345678
    1234 5678 1234 5678

Numerical tests used by the credit card issuers for card validity are used to exclude sequences of numbers that are not credit card numbers. However, these tests are not perfect and so the credit card number recognition feature may pick up some numbers that are not really credit card numbers. 

To search for a credit card number, put "creditcard()" around the number. Example: 

creditcard(1234*) 

Other numerical patterns 

To search for other numerical patterns such as social security numbers, you can use the = wildcard, which matches any single digit. For example, if hyphens are indexed as spaces, then the following search request would find U.S. social security numbers:

    === == ====

The = wildcard is disabled by default in the dtSearch Engine. To enable it, set Options.matchDigitChar to "='. 

Enabling automatic recognition of dates, email addresses, and credit card numbers 

In dtSearch Desktop, click Options > Preferences > Indexing Options, and check the box to "Automatically recognize dates in text." 

In the dtSearch Engine API, set the flag dtsoTfRecognizeDates in Options.TextFlags

There is no option to separately control whether dates, email addresses, and credit card numbers are recognized. 

Word lists 

To list all dates, credit card numbers or email addresses in an index, you can use the word listing functions in dtSearch Desktop (Index > List Index Contents...). In the dtSearch Engine API, you can use ListIndexJob (.NET) or DListIndexJob (C++). 

The same syntax used in search requests works in the listing functions, so if you generate a list using "creditcard(*)", you will get a list of all credit card numbers in the index. 

Effect on performance 

Indexing will be substantially slower with the recognition feature enabled. 

Searching for dates, email addresses, and credit card numbers can be substantially faster because you can search for a single unique expression instead of having to search for many different variations. For example, a single search for:

    creditcard(1234123412341234)

will find that credit card number regardless of the presence of spaces or punctuation between the numbers. To cover just the most common variations on credit card number formats would require a much more complex search request that would take more processing time. Similarly, it will be much faster to search for:

    date(January 15, 2005)

than to search for the many ways this date could be expressed in text.

Module
Links
You are here: Overviews > Recognition of Dates, Email Addresses, and Credit Card Numbers
Copyright (c) 1995-2016 dtSearch Corp. All rights reserved.