dtSearch includes an option to automatically recognize dates, email addresses, and credit card numbers in text during indexing.
Date recognition looks for anything that appears to be a date, using English-language months (including common abbreviations) and numerical formats. Examples of date formats that are recognized include:
To search for a date, put "date()" around the date expression or range. For example, to find any of the expressions above near the word "apple", search for:
To search for a range of dates near the word "apple", search for:
A field search for a date expression would be expressed like a field search for a word:
Unterminated ranges are not supported, so to search for any date after or before a particular date, enter a bounded range with a maximal or minimal value for the bounds. The maximum value for a year is 2900, and the minimum value is 1000. Example:
Ambiguous date expressions like 01/02/03 are presumed to be MM/DD/YY. To change this presumption, use the TextFlags values dtsoTfRecognizeDatesPresumeDMY or dtsoTfRecognizeDatesPresumeYMD to specify an alternative presumption.
Email address recognition looks for text that follows the syntaxes commonly used for a valid email address (example: firstname.lastname@example.org). This makes it possible to search for a specific email address regardless of the alphabet settings for the @ and . characters, as well as any other punctuation that may be present in an email address. Also, this makes it possible to use the word listing functions in dtSearch to enumerate all email addresses in a document collection.
To search for an email address, put "mail()" around the address. The * and ? wildcard expressions are supported inside the () marks. Examples:
To avoid excessive false positives, the email detection algorithm does not attempt to detect every string that could possibly be an email address.
Credit Card Numbers
Credit card number recognition looks for any sequence of numbers that appears to satisfy the criteria for a valid credit card number issued by one of the major credit card issuers. Credit card numbers are recognized regardless of the pattern of spaces or punctuation embedded in the number. Examples:
Numerical tests used by the credit card issuers for card validity are used to exclude sequences of numbers that are not credit card numbers. However, these tests are not perfect and so the credit card number recognition feature may pick up some numbers that are not really credit card numbers.
To search for a credit card number, put "creditcard()" around the number. Example:
Other numerical patterns
To search for other numerical patterns such as social security numbers, you can use the = wildcard, which matches any single digit. For example, if hyphens are indexed as spaces, then the following search request would find U.S. social security numbers:
The = wildcard is disabled by default in the dtSearch Engine. To enable it, set Options.matchDigitChar to "='.
Enabling automatic recognition of dates, email addresses, and credit card numbers
In the dtSearch Engine API, set the flag dtsoTfRecognizeDates in Options.TextFlags.
There is no option to separately control whether dates, email addresses, and credit card numbers are recognized.
To list all dates, credit card numbers or email addresses in an index, you can use the word listing functions in dtSearch Desktop (Index > List Index Contents...). In the dtSearch Engine API, you can use ListIndexJob (.NET) or DListIndexJob (C++).
The same syntax used in search requests works in the listing functions, so if you generate a list using "creditcard(*)", you will get a list of all credit card numbers in the index.
Effect on performance
Indexing will be substantially slower with the recognition feature enabled.
Searching for dates, email addresses, and credit card numbers can be substantially faster because you can search for a single unique expression instead of having to search for many different variations. For example, a single search for:
will find that credit card number regardless of the presence of spaces or punctuation between the numbers. To cover just the most common variations on credit card number formats would require a much more complex search request that would take more processing time. Similarly, it will be much faster to search for:
than to search for the many ways this date could be expressed in text.