(Level: Beginner)

“Constantly talking isn’t necessarily communicating.” – Charlie Kaufman

So far, we have covered the basics of regular expressions and tokenization. It must be evident by now how simple, yet fundamental these concepts are. Today’s lesson covers another important concept that is almost absolutely essential to any NLP task; stopwords filtering. You will understand what stopwords are, why we need to filter them and how to remove them.

What is it?

Let us talk a bit about how a search engine might handle user queries. Here is a fun little scenario. A group of three friends want to gift something for their fourth friend, who is out on business, currently. They decide to gift the current bestseller in their city Utopia, but there is a small problem; they do not know which book it is! So they all decide to search independently. Here is what their queries look like:

Friend 1: bestseller in utopia now
Friend 2: which book is today's bestseller in utopia?
Friend 3: need utopia's bestseller and need it now

Now, interesting, isn’t it? A simple query such as this one can be phrased in so many different ways. Will this confuse the search engine? Sure. What does the search engine really need, to return the appropriate result? Upon close inspection, you will observe that the common and most informative words are:

bestseller utopia

Effectively, all that a search engine needs, to return the required result are these words. Yet, different people formulate different queries, even though they may be searching for the same thing. In conclusion, all that matters to a search engine are these few keywords, which convey maximum meaning and indicate context. The remaining words, that convey no important information to the search engine and do not in any way contribute positively to better results, are called stopwords. The formal definition, however, is that stopwords are what linguists call function words. These function words have little meaning, or are ambiguous. Most search engines are built to ignore stopwords, based on a stoplist, which is nothing but a collection of stopwords.

In the previous example, these are the stopwords:

Query 1: in, now
Query 2: which, is, today's, in
Query 3: need, and, it, now

As you must have realized by now, these words have no positive effect on the search results. Typically, search engines maintain a list of keywords for every web page on the Internet. These keywords express what information the page is trying to convey, and stopwords are strictly not keywords. So when a user searches for something, keywords are extracted from her queries, like we saw in the above example. These keywords are used to search the database of the search engine for relevant results. Those web pages whose keywords most closely match with those of the user query are ranked, and then displayed to the user. Most modern search engines are much more sophisticated, but keyword matching is an essential component of these systems.

In the previous lesson on tokenization, we tried to identify the ten most frequently occurring words in Alice’s Adventures in Wonderland, a book by Lewis Carroll. Here is a summary of what we discovered:

[('the', 1818), ('and', 940), ('to', 809), ('a', 690), ('of', 631), ('it', 610), ('she', 553), ('i', 545), ('you', 481), ('said', 462)]

As you may have noticed, there is nothing in this list that conveys anything useful or unique about the book. Keywords are what make web pages unique. Without them, it would be virtually impossible to differentiate between web pages. Search engines identify the most frequently occurring words in web pages using more complex algorithms that build on top of what we have already covered. Now, let us pretend that a search engine called Searchybot uses the ten most frequently occurring words in a web page as keywords. If most of Searchybot’s pages have keywords that are stopwords, then we are dealing with a bad search engine here. Searchybot would return very poor results, primarily because it has no way of differentiating between web pages.

How do I remove them?

Now we get to the fun part. Python, our friend is back in the picture. But before we start coding, we need to know what words to filter right? Basically, we need to populate a list of stopwords ourselves, every occurrence of which we must remove from the input text. So, based on our observation of word frequencies of our Alice’s Adventures in Wonderland book, how about we use these few words as our list of stopwords?

['the', 'and', 'a', 'of', 'is']

Okay, so now that we have a list of stopwords, let us quickly analyze the flow of our program. We would begin by reading some input text, for which we want to identify the most frequently occurring words. Next, we tokenize the input text into a collection of words and store it in a list. Each word in this tokenized list is checked with the stopwords list; if the current word is a stop word, we remove it from the tokenized list. Finally, we compute the top three most frequently occurring words and display them.

In case you haven’t read the previous articles on regexs and tokenization, it is highly recommended that you explore those first.

So, here’s what the code looks like:

#Top 3 most commonly occurring words in some text, with stopwords filtering.
import re
from collections import Counter
#Import Counter into our program.
regex = "[a-zA-Z]+"
stopwords_list = ['the', 'and', 'a', 'of', 'is']
text = """The beginning of the end of a wonderful life is
the ending of the that life and the beginning of a new one."""						#Some input text, stored as a string.
sentence = text.lower()                     										#Convert all letters of the input text to their lower case
tokenized_words = re.findall(regex,sentence)
filtered_words = [word for word in tokenized_words if word not in stopwords_list]	#Remove stop words from the tokenized words list.
wordcount = Counter(filtered_words)               									#Store words and their word counts in a Counter.
print(wordcount.most_common(3))

Most parts of this code must be familiar by now. We are using the tokenizer you built in the previous lesson, using regexs. However, there is one line of code that may be a little confusing at first. Watch this closely:

filtered_words = [word for word in tokenized_words if word not in stopwords_list]

This is a very Pythonic way of writing code. Python as a language is magical, largely because of its style and strong emphasis on simplicity. How about we rewrite it this way?

filtered_words = list()
for word in tokenized_words:
	if word not in stopwords_list:
		filtered_words.append(word)

Easier to understand? Thought so! So all that we are effectively doing is that we are checking if each word in the input text is a stopword or not. If it is a stopword, then we ignore it. Else, we add it to the list called filtered_words. So what output does this produce?

[('beginning', 2), ('life', 2), ('new', 1)]

This gives you a good idea of what the text is about. Worthy of being keywords, don’t you think?

And now what if we do not perform stopword filtering? Here is what the output
looks like:

[('the', 5), ('of', 4), ('beginning', 2)]

As you may have observed, “the” and “of” are the top two most common words. In a real-life application, such as in search engines, these provide no help in finding the most relevant results, whatsoever. It is best to avoid stopwords.

Sure, it is going to be hard to come up with your own list of stopwords, because there are so many. One solution is to use an already existing collection of stopwords. One such simple-to-use solution is provided by NLTK. NLTK provides a stopwords corpus by Porter et al., and contains about 2,400 stopwords in 11 languages. We will use the English one, for obvious reasons. Feel free to experiment with other languages as well. You can implement it as follows:

#Top 3 most commonly occurring words in some text, with stop worde filtering.
import re
from nltk.corpus import stopwords                                                   #Import the stopwords module from nltk
from collections import Counter
#Import Counter into our program.
regex = "[a-zA-Z]+"
text = """The beginning of the end of a wonderful life is
the ending of the that life and the beginning of a new one."""						#Some input text, stored as a string.
sentence = text.lower()                     										#Convert all letters of the input text to their lower case
tokenized_words = re.findall(regex,sentence)
filtered_words = [word for word in tokenized_words if word not in stopwords.words('english')]	#Remove stop words from the tokenized words list.
wordcount = Counter(filtered_words)               									#Store words and their word counts in a Counter.
print(wordcount.most_common(3))

Two key things to note here. First, you will need to import NLTK’s stopwords module. You can remove your stopwords_list list, since we won’t be using that anymore. Second, note this line of code:

filtered_words = [word for word in tokenized_words if word not in stopwords.words('english')]

That’s that. All throughout this post, we have been criticizing the presence of stopwords in text. But are there cases when they might actually turn out to be useful. Consider these examples:

restaurants v/s restaurants nearby
matrix v/s the matrix                 (Movie, The Wachowskis, 1999) 
dance v/s one dance                   (Song, Drake ft. Wizkid & Kyla, 2016)

Two tasks for this week. The first one requires you process this blog post; you need to identify the top ten most frequently occurring words in this post, after eliminating stopwords. You can use NLTK’s stopwords list to perform stopwords filtering. Second, compute top ten most frequently occurring words in last week’s book of discussion, Alice’s Adventures in Wonderland, after eliminating stopwords.

The code accompanying this week’s lesson can be found on our GitHub repository. Code from all previous lessons can be found here. Feel free to fork/clone our repo. Happy coding!

Curious? Check these out:

What does Stanford have to say about stopwords?
Could Wikipedia’s 100 most common words serve as a good stopwords list?