Should I stem or should I go?

Should I stem or should I go?

Michele Riva

Algorithms

4

min read

Apr 13, 2023

Should I stem or should I go?

In the context of full-text search, stemming is a text processing technique used to improve the effectiveness of searching by reducing words to their base or root form, called the “stem.” This normalization helps overcome the issue of different word forms that arise due to inflections, conjugations, or derivations, allowing the search algorithm to match relevant documents more efficiently and accurately.

When a user submits a query, both the query terms and the text in the documents are stemmed. This allows the search algorithm to match the stemmed query terms with the stemmed words in the documents, improving the chances of retrieving relevant results even if the original query terms and the words in the documents have different forms.

For example, if a user searches for "running shoes", the search algorithm would stem the query terms to "run" and "shoe." The stemmed query would then be compared to the stemmed words in the documents, allowing the search to return relevant results that include variations of the words like "runner", "ran", or "shoeing".

The Porter Stemmer

The Porter Stemmer is a widely-used stemming algorithm developed by Martin Porter in 1980. It is a rule-based algorithm designed to reduce English words to their root form or stem by removing common suffixes and prefixes.

Martin Porter taking a selfie

The algorithm works in a series of steps, called phases, with each phase consisting of a set of rules. The rules are applied sequentially, and a word is transformed according to the first applicable rule in each phase.

The five main phases of the Porter Stemmer are:

  1. This phase deals with plurals and -ed or -ing suffixes. For example, “caresses” becomes “caress,” “ponies” becomes “poni,” and “hopping” becomes “hop.”

  2. This phase removes various suffixes, like -ational, -tional, -izer, -bli, and -alli, and replaces them with simpler forms. For example, “relational” becomes “relate,” and “digitizer” becomes “digitize.”

  3. This phase deals with suffixes like -icate, -ative, -alize, -iciti, -ical, -ful, and -ness. For example, “triplicate” becomes “triplic,” and “hopefulness” becomes “hopeful.”

  4. This phase involves removing certain suffixes, such as -al, -ance, -ence, -er, -ic, -able, -ible, -ant, -ement, -ment, -ent, -sion, -tion, -ou, -ism, -ate, -iti, -ous, -ive, and -ize, but only if the resulting stem has a certain length or measure. For example, “revival” becomes “reviv,” and “adjustable” becomes “adjust.”

  5. This final phase deals with removing -e, -l, or -y if certain conditions are met. For example, “probate” becomes “probat,” and “rate” remains “rate.”

Stemming in Orama

At the time of writing, Orama supports stemming in 26 languages out of the box: Arabic, Armenian, Bulgarian, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Nepali, Norwegian, Portuguese, Romanian, Russian, Serbian, Slovenian, Spanish, Swedish, Turkish, Ukrainian.

You can easily import them by doing the following:

import { create } from '@orama/orama'
import { stemmer } from '@orama/orama/stemmers/it'

const db = await create({
  schema: {
    author: 'string',
    quote: 'string',
  },
  language: 'italian',
  components: {
    tokenizer: {
      stemmingFn: stemmer,
    },
  },
})

In that case, every document inserted in the db instance will be stemmed using the Italian stemming rules defined by the Porter stemmer.

Should I use a stemmer?

The stemming process can help refine search results, but sometimes can produce unexpected results.

Let’s make a simple example. Suppose we have the following documents:

[
  {
    text: 'I am running',
  },
  {
    text: 'I am running fast',
  },
  {
    text: 'I run fast',
  }
]

If we search for "running", we would expect to get the first two documents as results.

However, if we use a stemmer, the search algorithm would stem the query term to "run", and the stemmed query would match the stemmed words in the documents, allowing the search to return all three documents as results, which is not something we might want.

This can also affect search-as-you-type features, where the user is typing a query and the search results are updated in real-time.

Given the documents above, where the term "run" and "running" are both present, the stemming would affect the search results as follows (assuming the user is typing the search term "running" letter by letter):

  • "r": all three documents are returned (prefix search)

  • "ru": all three documents are returned (prefix search)

  • "run": all three documents are returned (prefix search and exact match)

  • "runn": no documents are returned. This is because the stemmer removes the -ing suffix, and the word “runn” is not present in the documents.

  • "runni": no documents are returned again.

  • "runnin": no documents are returned again.

  • "running" returns all three documents

If you’re building a search-as-you-type feature, you may want to consider disabling stemming for the search-as-you-type feature, and only use it for the final search.

Conclusion

In conclusion, stemming is a valuable technique in natural language processing and full-text search that helps improve search accuracy and efficiency by reducing words to their base or root form.

The widely-used Porter Stemmer algorithm is a rule-based approach that works well for many languages, and it can be easily integrated into applications like Orama.

However, stemming may produce unexpected results in certain cases, such as search-as-you-type features or when dealing with ambiguous terms. Therefore, it’s essential to understand the strengths and limitations of stemming and carefully consider when and how to implement it in your search application.

By doing so, you can create a search experience that is both accurate and efficient for your users.

Run unlimited full-text, vector, and hybrid search queries at the edge, for free!