Regular Expressions
Regular expression searching provides a way to search for advanced combinations of characters. A regular expression included in a search request must be quoted and must begin with ##.
Examples:
Apple and "##199[0-9]"
Apple and "##19[0-9]+"
In addition to searching, dtSearch can use regular expressions in File Segmentation and Text Fields rules.
Special characters in a regular expression are:
Regular expression |
Effect |
---|---|
. (period) |
Matches any single character. Example: "sampl." would match "sample" or "samplZ" |
\ |
Treat next character literally. Example: in "\$100", the \ indicates that the pattern is "$100", not end-of-line ($) followed by "100" |
[abc] |
Brackets indicate a set of characters, one of which must be present. For example, "sampl[ae]" would match "sample" or "sampla", but not "samplx" |
[a-z] |
Inside brackets, a dash indicates a range of characters. For example, "[a-z]" matches any single lower-case letter. |
[^a-z] |
Indicates any character except the ones in the bracketed range. |
.* (period, asterisk) |
An asterisk means "0 or more" of something, so .* would match any string of characters, or nothing |
.+ (period, plus) |
A plus means "1 or more" of something, so .+ would match any string of at least one character |
[a-z]+ |
Any sequence of one or more lower-case letters. |
dtSearch uses the TR1 implementation of regular expressions, which provides many capabilities beyond what is described above.
Limitations
(1) A regular expression must match a single whole word. For example, a search for "##app.*ie" would not find "apple pie".
(2) Only letters are searchable. Characters that are not indexed as letters are not searchable even using regular expressions, because the index does not contain any information about them.
(3) Because the dtSearch index does not store information about line breaks, searches that include begining-of-line or end-of-line regular expression criteria (^ and $) will not work.
(4) No case or other conversion is done on regular expressions, so a regular expression must match the case of the information stored in the index. If an index is case-insensitive, all letters in the regular expression must be lower-case. If a character is not searchable in the index, then it cannot be included as a searchable character in the regular expression. Non-searchable characters in a regular expression are not ignored as they are in other search expressions.
Performance
A regular expression is like the * wildcard character in its effect on search speed: the closer to the front of a word the expression is, the more it will slow searching. "appl.*" will be nearly as fast as "apple", while ".*pple" will be much slower.
Searching for numbers
The = wildcard, which matches a single digit, is faster than regular expressions for matching patterns of numbers. For example, to search for a social security number, you could use "=== == ====" instead of the equivalent regular expression.