If you’ve ever wanted to know what a document or piece of text is about without reading the entire thing, you’ll be glad to know you can do so using keywords. Keywords, in this context, are words or short phrases that concisely describe the contents of a larger text. This post describes the working of a relatively new approach to automatically generating keywords from a given document, called Rapid Automatic Keyword Extraction (RAKE).
Why do we need keywords in the first place?
- To concisely represent the essential contents of a text/document.
- To fetch relevant documents using keywords in an Information Retrieval (IR) system.
- To improve the presentation of a document (say, by highlighting important phrases).
- To help classify documents.
Normally, keywords are manually chosen by the authors or publishers of a document. Personal experience has proven to me that this is a tedious activity, especially when multiple documents are involved. Luckily, there are ways for computers to quickly and reasonably accurately do the work for us. The task of automatically identifying the most suitable terms (from the words used in the document) that describe a document is called keyword extraction. The focus of this post is a keyword extraction algorithm called Rapid Automatic Keyword Extraction (RAKE). Click here to view the original research, which was published in 2010.
The hallmarks of the RAKE algorithm are
- its ability to operate independently on documents without referring to a corpus (domain independence); and
- its very reasonable precision despite its simplicity and computational efficiency.
RAKE is built on the observation that keywords usually contain multiple informative words (called content words) but not punctuation and stopwords. So, in a document about various corn-based foods, “corn fritters”, “popcorn” and “corn flakes” might appear as keywords while “corn on the cob” wouldn’t be considered because it has two very common stopwords: “on” and “the”. The entire algorithm is as follows.
Given an input document on which we want to extract keywords,
- Split the document into an array of words, breaking it at word delimiters (like spaces and punctuation).
- Split the words into sequences of contiguous words, breaking each sequence at a stopword. Each sequence is now a “candidate keyword”.
Let’s see what we have so far.
Consider the short text “A scoop of ice cream.” We break this into words to get
["A", "scoop", "of", "ice", "cream"]
Step 2 arranges these words into sequences by avoiding stopwords. The words “A” and “of” are bound to be on any stoplist you’re using. So, by reading the array from left to right, skipping stopwords, and creating a new candidate keyword every time a stopword is encountered, we obtain two candidate keywords:
["scoop", "ice cream"]
Now back to the algorithm.
- Calculate the “score” of each indivudual word in the list of candidate keywords. This is calculated using the metric:
It’s easy to understand what the frequency of a word is. It’s simply the number of times the word occurs in the entire list of candidate keywords. So our word frequencies are:
frequency("scoop") = 1 frequency("ice") = 1 frequency("cream") = 1
But what is the degree of a word, and more importantly what does it mean? The explanation involves a bit of (simple) graph theory. If you don’t want to get into that, you can skip the next paragraph; the intuition behind the degree of a word is summarized after the explanation.
The degree of a word in this context is similar to the degree of a node in a graph. Let’s represent this problem as a graph. Draw an undirected graph with each content word as a node. Connect two nodes together if they appear in the same candidate keyword. The more connections a node has (i.e the higher its degree), it tends to mean that the word occurs often and in longer candidate keywords.
So, the degree of a word represents how frequently it co-occurs with other words in the candidate keywords. The degrees of the words in our sample sentence are:
degree("scoop") = 1 degree("ice") = 2 degree("cream") = 2
If that didn’t make sense, to find the degree of word W, all you need to do is count the number of words that occur in candidate keywords containing W, including W itself. Consider the following list of candidate keywords, taken from an example that is used in the original RAKE research paper:
Compatibility – systems – linear constraints – set – natural numbers – Criteria – compatibility – system – linear Diophantine equations – strict inequations – nonstrict inequations – Upper bounds – components – minimal set – solutions – algorithms – minimal generating sets – solutions – systems – criteria – corresponding algorithms – constructing – minimal supporting set – solving – systems – systems
If we want to find the degree of the word “set”, degree(“set”), we simply count the total number of words that appear in candidate keywords containing the word “set”. So,
degree("set") = 6 (see the bold candidate keywords) degree("natural") = 2 (see the underlined candidate keyword) ...
Degree(“set”) > degree(“natural”) because “set” co-occurs with other words more frequently than “natural”. Also, if there were a long candidate keyword containing, say, 5 words including some word W, then the degree of W would be at least 5. Thus, a higher word degree could also indicate that a word appears in a long candidate keyword.
We now know what the degree and frequency of a word are, but what about their ratio, which is the word score metric used in RAKE?
word_score = degree(word)/frequency(word) = ???
The word score has the word degree in the numerator and the word frequency in the denominator. This means that the word score is proportional to the word degree and inversely proportional to the frequency. We know that the degree is high when a word appears frequently, especially with other words, and when the word appears in long candidates. The frequency is high when a word appears frequently, regardless of where it appears. So, the RAKE word score metric disfavours words that appear too frequently and not in long candidates, and favours words that predominantly occur in longer candidate keywords.
Back to the algorithm:
- For each candidate keyword, add the word scores of its constituent words to find the candidate keyword score.
- Take the first one-third highest scoring candidates from the list of candidates as the final list of extracted keywords.
In our first example, the word scores are:
# word_score = degree(word)/frequency(word) word_score("scoop") = 1/1 = 1 word_score("ice") = 2/1 = 2 word_score("cream") = 2/1 = 2 score("scoop") = word_score("scoop") = 1 score("ice cream") = word_score("ice") + word_score("cream") = 4
So, the highest scoring extracted keyword is from the sentence “A scoop of ice cream.” is “ice cream”.
That’s about it! The advantages of RAKE are its speed, compuational efficiency, ability to work on individual documents and its precision despite its simplicity. We recommend that you read the original research work, which is much more detailed, contains performance metrics, and describes how to generate stoplists to configure RAKE for specific domains.
Try a Python implementation of RAKE. Implementations exist in other languages as well.