(Level: Beginner to Intermediate)
This week’s post is about stemming. It’s a little different from our previous articles because we’ll discuss some English grammar before getting into the technicalities and coding. But even before that, try out this quick experiment (10 seconds, max) and you’ll probably immediately understand what stemming is all about.
Open a web search engine like Google and search for some term – preferably a common English word and not the names of movies, songs, people or other proper nouns. To simplify things, start with a single word for now. I’m into science fiction stuff, so I tried searching for “teleport”. When the search is complete, look at the search results and carefully observe the words in bold. The words that were in bold for me were not just teleport, but also teleports, teleportation, teleporting and teleported! What happened here? I only searched for teleport, but Google returned documents that contained various forms of the word teleport as well. This is important because the Wikipedia page that I wanted was titled Teleportation. The page also had more instances of the word teleportation than teleport. The specifics of Google’s search algorithm aren’t known, but it’s possible that if Google had only searched only and exactly for teleport, the page I was looking for would have received a lower score. You might have encountered a similar situation after searching for your word.
You might already see where this is going. All these different words are basically just forms of the same word: teleport. In grammar, these different forms are called inflections. Dogs is an inflection of dog. Walking is an inflection of walk. Verb conjugations, tense changes, plurals of nouns, and participle forms of words are some types of inflections, but there are many more. Inflected and derived words can be mapped to a kind of base word. This base word is called the word stem. Think of a plant with smaller branches or offshoots emerging from the stem. In the same way, these inflected words are based on the word stem. So, teleports, teleportation, teleported and teleporting are all inflected forms of teleport.
This is important, because when people do a web search or any other information retrieval task, they often do not want to search for a particular word or phrase verbatim. They might want to see results that contain words that are similar to the search query. “Similar” could mean synonyms – for example, when you search for “cheap rooms in New York”, you should also be able to see documents containing the phrase “affordable accommodation in New York”. “Similar” might also mean inflected forms, like in the above example about teleportation. Stemming is simply the process of reducing words to their stems. It is about identifying that teleport is the stem of teleportation, dog is the stem of dogs, and walk is the stem of walking. It is also about recognizing that teleportation and teleported have the same stem. The main idea is that words with the same stem will probably have pretty similar meanings.
In nearly every case of stemming, all we need to do is remove a certain suffix to obtain the word stem. To stem the word teleportation, we just need to chop off the final -ation and we’re left with teleport. It seems pretty simple at first glance, but there is a lot of scope for things to go wrong by just blindly removing the last bit. For example, what do you think the stem of the word fiction is? Fic? Using one of the most popular stemmers, Porter’s stemmer, it’s actually just fiction itself. The stem of fictional is fiction. In 1980, Martin Porter laid down a simple rule-based algorithm to stem most English words with a good level of performance. The gist of the algorithm is that a given word that needs to be stemmed is first expressed in a certain form. It is then modified through a sequence of rule-based transformations until the stem is obtained. These rules are based on an analysis of the structure of English words. Here are a few rules just to show you what the algorithm is like:
SSES -> SS ATIONAL -> ATE TIONAL -> TION
The first rule means words with the suffix SSES are transformed to have the suffix SS. Similarly, in a later step, the suffix ATIONAL transforms to ATE (but only under a certain condition, which isn’t discussed here). We recommend that you read Porter’s original research paper, choose a word to stem and follow the steps to stem it according to the algorithm. It’s a simple and interesting read, and you’ll get to know exactly how it works. Feel free to ask us questions in the comments if you have any! There are also demos of the Porter stemmer available online.
Note that the stem of a word isn’t always a real English word. Happy and happiness both reduce to the stem happi. But this isn’t an error! As long as it is understood that happy and happiness have the same stem, the purpose is met – and that purpose is to make it known to an information retrieval system that happy and happiness have similar meanings, and that the search must include both words.
A slightly improved version of the Porter Stemmer is the Porter2 Stemmer or the Snowball English Stemmer. Let’s take a look at how to use these stemmers in Python with the help of NLTK.
from nltk.stem import * # create a new Porter stemmer stemmer = PorterStemmer() # you could also use the Porter2 Stemmer # stemmer = SnowballStemmer("english") wordToStem = 'teleportation' stemmedWord = stemmer.stem(wordToStem) print(stemmedWord)
The output of this program is
The code challenge for this post is to make a very rudimentary search engine. We’ll provide you with the resources you need. This search engine doesn’t need to return any documents though. It has to count the number of occurrences of the search query in a file containing some results that I’ve already pulled from an earlier Google Search. Just counting is simple, but what we need to do is allow a user to search for a term, stem the term, and look for occurrences of the stem in the results. You can also try to compare how stemming makes a difference in the number of results found.
Here is a link to the file to search for results in. The file contains a list of Google Search results for the word “evaluation”. Your program should stem a search query and return the count of words having the same stem. This way, your program will count inflections of the search query and not just the literal string. View the solution here (Github repo).
Note that our solution may not be how Google’s search engine actually stems. The purpose of the code is just to show how search queries and search results can be stemmed so as to incorporate word inflections and derivations in a search.
An Algorithm for Suffix Stripping (the name of Porter’s paper)
Official page of Porter’s Algorithm
Official site of Snowball
Stemming using NLTK
Github repository for this post