Benefits of ICU Integration
Remarks
Beginning with version 7.93, dtSearch can integrate with ICU to enhance text processing in the dtSearch Engine. Benefits of ICU integration include:
- A new accent-optional index type is supported, and can be enabled using the flag dtsIndexCreateOptionalAccentSensitive. In an accent-optional index, accented letters can be made significant for matching purposes, but unaccented letters will still always match both accented and unaccented forms. For example, a search for "abc" will find both "abc" and "äbc". A search for "äbc" will find different results depending on whether the flag dtsSearchRequireAccents is set in the SearchJob. If dtsSearchRequireAccents is set, then "äbc" and will match "äbc" and will not match "abc". If dtsSearchRequireAccents is not set, then "äbc" will match both "äbc" and "abc".
- Support for single-word token characters is enabled separately from the dtsoTfAutoBreakCJK flag. A single-word token character is a character that is always indexed as a separate word. This can be used to make characters such as emojis or currency symbols searchable.
- Unicode character properties such as case and compatibility mapping are supported for 32-bit characters in the Supplementary Multilingual Plane (U+10000-U+1FFFF) and the Supplementary Ideographic Plane (U+20000-U+2FFFF).
- Developers can deploy any compatible version of the ICU library, providing the option to upgrade or modify character handling. Because ICU is open source, developers can create custom versions of the ICU libraries to address specialized requirements.
- ICU has automatic encoding detection that can identify the encoding of ambiguous files such as plain text files without a byte order marker or encoding name.
- In an accent-insensitive index, characters defined has having "compatible" equivalence in the Unicode Standard will be treated as equivalent for searching, and diacriticals are removed. Examples:
Character |
Mapping |
⼏ U+2f0f (kangxi radical table) |
几 U+51e0 (cjk unified ideograph-51e0) |
が U+304c (hiragana letter ga) |
か U+304b (hiragana letter ka) |
ガ U+30ac (katakana letter ga) |
カ U+30ab (katakana letter ka) |
ㄱ U+3131 (hangul letter kiyeok) |
ᄀ U+1100 (hangul choseong kiyeok) |
㊌ U+328c (circled ideograph water) |
水 U+6c34 (cjk unified ideograph-6c34) |
論 U+f941 (cjk compatibility ideograph-f941) |
論 U+8ad6 (cjk unified ideograph-8ad6) |
🈐 U+1f210 (squared cjk unified ideograph-624b) |
手 U+624b (cjk unified ideograph-624b) |
ﮎ U+fb8e (arabic letter keheh isolated form) |
ک U+06a9 (arabic letter keheh) |
Group