Stemming Rules

Stemming rules vary from one language to another.  dtSearch includes a set of stemming rules designed to work with English.  These rules are in the file stemming.dat, which is installed to the "data" folder under the dtSearch program folder.  If you need to implement stemming for a different language, or you want to modify the English stemming rules, you can create a new set of stemming rules to be used in place of stemming.dat.

Stemming rules consist of a series of lines like this:

3+ies -> Y

4+ing ->      

The first rule would convert any word with three or more letters followed by ies to the same initial letters followed by y.  Applies would turn into apply.

The second rule would remove the ing from any word with four or more letters followed by ing.  Fishing would turn into fish, but sing would not change.

In general, a rule consists of: a minimum number of letters (not including the suffix), a + sign, a suffix to be removed, an arrow (->) and the replacement for the suffix, if any.  Stemming rules must use lower-case letters only.

When stemming a word, dtSearch will look at each rule in order until it finds one that applies.  If it finds a rule, dtSearch will apply the rule and then start over, repeating the process until the word does not change.  The result is the "stem" of the original word.

Sometimes you may want to create a rule with an exception.  For example, suppose you want to remove a trailing "s" in a word, unless the word ends in "ss".  To do this, you would use these two rules:

3+ss -> ss

3+s -> 

If a word ends in "ss", dtSearch will never get past the first rule and will give up stemming the word because the rule "3+ss -> ss" does not change the word.  Only words not ending in "ss" will get to the next rule, which removes the trailing "s".

To help with stemming rules customization, dtSearch includes the STEMTEST utility.  STEMTEST will allow you to try out your stemming rules, entering words and seeing what the resulting stem words are.