Tokenization

(Level: Beginner)

“A computer is only as smart as its programmer.” – Unknown

Last week’s tutorial covered the basics of regular expressions or regexs, along with some sample code for your understanding. This week is about tokenization. Sounds fancy? Easy to understand, yet extremely powerful. By the end of this tutorial, you’ll understand what it is, why you will need it, and how you can build your own tokenizer.

What is it?

The Harry Potter series (J.K. Rowling) is an exciting read. Here is a fun idea; let us say that you are trying to determine the top 100 words that occur in the series. How would you do that? There are 7 books in total, each book has at least a few hundred pages, and each page has at least a few hundred words. Intuitively, you would focus on individual words now, not the book as a whole. Immensely challenging for any human being, to say the least. However, this seems like a problem that computers could solve, don’t you think?

To simplify the problem a bit, let us first try to determine the top 100 words in the second book of the series, “Chamber of Secrets”. We could feed the computer a copy of the book, as a text file, and store the whole book in a really big string. This really big string contains all the chapters, paragraphs, sentences and words in the book. Next, we need to determine the word count of each unique word in the book. This means that we need to split the really big string into a collection of all the words in the book, and count of the number of occurrences of each word. Finally, we rank the words in descending order of their word counts, and pick the top 100 words. Confusing? Don’t worry, the whole idea behind this example was to get you thinking about tokenization. You’ll understand this by the end of the tutorial.

Simply put, tokenization is the process of splitting text into smaller components, called tokens. When we say text, we could be talking about paragraphs or sentences. As an example, we can tokenize a paragraph into a collection of the sentences that make up the paragraph. Similarly, we can tokenize a sentence into a collection of the words that make up the sentence. Tokens are any combination of characters, also called patterns, which you have read about in the previous article. In our above example, we tokenized a really big string, that contained all the words in the book together, into a collection of words. This makes it easy for us to determine the word count of each word.

Examples:
Let us try tokenizing a paragraph as a collection of sentences. Before tokenization, we have a paragraph:

The sun rises in the east. This is a well known fact! What are your thoughts?

After tokenization, we now have a collection of sentences:

Token 1: The sun rises in the east.
Token 2: This is a well known fact!
Token 3: What are your thoughts?

Each sentence in the above example is a token.

Similarly, we can tokenize a sentence as a collection of words. Before tokenization, we have a sentence:

The sun rises in the east.

After tokenization, we now have a collection of words:

Token 1: The
Token 2: sun
Token 3: rises
Token 4: in
Token 5: the
Token 6: east

That’s what tokenization is all about. Splitting text into a collection of tokens. As you may have noticed in the above example, tokenization got rid of the spaces and full-stops. Thus, tokenization, in addition to splitting up text into tokens, also can be used to get rid of a certain type of tokens, like punctuations in this case. These types of tokens that you want to eliminate are called delimiters.

Okay, so what’s the point?

It is always easier for you to work with smaller problems, than one giant problem, right? That’s kind of the idea here, intuitively. Splitting a big chunk of text into it’s constituent tokens makes it easier to perform operations that involve these tokens, such as computing word counts, checking spellings, finding patterns or even removing tokens.

Have you ever wondered how the machine creates a runnable program, just with the code that you write, which is nothing but a lot of text? The whole process is called compilation. Very broadly speaking, a compiler converts the code that you write into a form that the machine understands, which can then be executed. The compiler needs to check for errors in syntax that you may have made when you wrote code. Do you know the first step that a compiler performs when you compile code? Tokenization, of course. In fact, in almost every text based problem, tokenization is the first step.

Let us revisit our problem of trying to compute the top 100 words in the Harry Potter series. If we are not careful enough, our machine might end up considering punctuations like spaces, commas, semi-colons and full-stops as valid tokens, and compute their frequencies as well. This will prove to be detrimental to our experiment, and we will end up with an incorrect solution. Punctuations in this case are the delimiters, and we can use the tokenizer to eliminate them. So the function of the tokenizer in this problem is two-fold; perform word count and eliminate punctuations.

Alright, so how do I code a tokenizer?

Great, so now that you know what a tokenizer is, and why you will need one, let us focus on how to create one. Enter, Python. We will begin by building a simple word tokenizer, from scratch.

Let us begin by creating a simple Python program to split sentences into words, while removing spaces. This is called a whitespace tokenizer. In this example, you will be using regexs to create the tokenizer. See how connected things are?

In case you aren’t familiar with regexs, it is highly recommended that you read this first.

#Whitespace Tokenizer.
import re
regex = "[^\s]+"
sentence = """At the age of one year old, Harry had somehow survived a curse from
the greatest Dark sorcerer of all time, Lord Voldemort, whose name
most witches and wizards still feared to speak."""
tokenized_words = re.findall(regex,sentence)
print(tokenized_words)

This produces the output:

['At', 'the', 'age', 'of', 'one', 'year', 'old,', 'Harry', 'had', 'somehow', 'survived', 'a', 'curse', 'from', 'the', 'greatest', 'Dark', 'sorcerer', 'of', 'all', 'time,', 'Lord', 'Voldemort,', 'whose', 'name', 'most', 'witches', 'and', 'wizards', 'still', 'feared', 'to', 'speak.']

That’s it! You have built your own whitespace tokenizer. As you may have observed from the output above, the string sentence has been split into a list of words, stored in tokenized_words. Whitespaces have been eliminated. And all of this is done using regexs that you already know about.

But, there is still an issue that needs to be addressed. Other punctuations, such as full-stop and comma have not been removed. These will affect our word count too. As of now, let us eliminate all punctuations. How do you think we should do that?

To simplify things, let us look for just alphabets, and eliminate all else. Can regexs come to the rescue again? Indeed they can. We modify the code as follows:

import re
regex = "[a-zA-Z]+"
sentence = """At the age of one year old, Harry had somehow survived a curse from the greatest Dark sorcerer of all time, Lord Voldemort, whose name most witches and wizards still feared to speak."""
tokenized_words = re.findall(regex,sentence)
print tokenized_words

Output:

['At', 'the', 'age', 'of', 'one', 'year', 'old', 'Harry', 'had', 'somehow', 'survived', 'a', 'curse', 'from', 'the', 'greatest', 'Dark', 'sorcerer', 'of', 'all', 'time', 'Lord', 'Voldemort', 'whose', 'name', 'most', 'witches', 'and', 'wizards', 'still', 'feared', 'to', 'speak']

There you go, much better isn’t it? Now how about we try this out on a real document? Let us rank the top 10 words of “Alice’s Adventures in Wonderland”, by Lewis Carroll. Note that this may not be the exact order, because of the simplicity of the tokenizer that we are using, but for all intents and purposes, this will be very similar to the exact order. Before you start, feel free to get a copy of the book that we will be using for this example from Project Gutenberg . All the associated code for this week’s tutorial can be found here.

For the purpose of this example, we will call the book that you just downloaded as alice.txt. We first copy the contents of the text file into a string. Here’s another important thing to note: Python treats two variants of the same word, one in upper case and another in lower case, as two separate words. eg. He and he are treated as two different words. We do not want that. So, we will convert all the letters to the same case. Let us convert them to lower case, using Python’s lower() function. Next, we perform tokenization.

Now that we have a list of tokens, we need to find word counts. This can be done using a container called Counter. A container is used to store data, like lists, dictionaries. Counter is a dictionary that stores data and their frequencies, in descending order of their frequencies. Finally, we use the most_common() function, to specify how many of the top ranked words we want to see. We specify that number as a parameter to most_common()

#Top 10 most commonly occurring words in Alice's Adventures in Wonderland, by Lewis Carroll.
import re
from collections import Counter                    #Import Counter into our program.
regex = "([a-zA-Z]+)"
file = open("alice.txt","r")                #Open the book, stored as a text file.
sentence = file.read().lower()                    #Read the contents of book, convert all letters to their lower case and store them in a string.
tokenized_words = re.findall(regex,sentence)
wordcount = Counter(tokenized_words)            #Store words and their word counts in a Counter.
print(wordcount.most_common(10))                #Display the top 10 most commonly occurring words.

Top 10 most commonly occurring words, along with their frequency of occurrence are:

[('the', 1818), ('and', 940), ('to', 809), ('a', 690), ('of', 631), ('it', 610), ('she', 553), ('i', 545), ('you', 481), ('said', 462)]

Impressive? That’s how powerful computers are. You can do a whole bunch of things that human beings would struggle to do in years together, in a matter of seconds on a computer. And that’s why domains like NLP, ML and AI are gaining traction. Human beings are trying to utilize the power of these machines for solving common day problems.

This week’s challenge is going to be fun. Try to compute the top 10 words in other books that you can find on Project Gutenberg. In addition to this, there is one more problem that you must attempt to solve. Our tokenizer unfortunately treats words with punctuations, such as don’t as two separate words, don and t. Another example: Tim‘s as Tim and s. This is obviously wrong. Can you modify the regex to incorporate such cases?

7 Comments

Derek Thomas
March 14, 2017 — 12:19 am


First off great article!

One minor comment, I believe its better practice to use raw strings for regexs rather than the double quotations. It allows for more precise regexs.

LikeLike
- codelingo (Post author)
  March 20, 2017 — 6:20 am
  
  
  You’re absolutely right about that! In fact, this is addressed in the Regex – Part 2 post. When you use a raw string, Python doesn’t pre-interpret special characters in the string before passing it to the re module. The re module will receive your regex pattern exactly as you’ve entered it.
  
  LikeLike
spiff
March 15, 2017 — 10:21 am


hey guys.. just trying to understand Regex better, but haven’t quite been able to solve the last part that you guys asked here (about don’t vs don etc)

so if I do this:
test = “””don’t Don, i’ll do it”””
regex = r”(?=.*[a-zA-Z\’])([a-zA-Z\’]+)”
tokenized_words = re.findall(regex,test)

the above works fine and I get “don’t” as well as “i’ll” (along with other words)

but if I apply to alice,txt, it doesn’t work.. is there any easy way to pick this up pls?

thanks vm!

LikeLike
- codelingo (Post author)
  March 19, 2017 — 5:23 pm
  
  
  hello friend, thank you for your enthusiasm and positive feedback about our blog. So your question is a very interesting one. I was just checking out the code and alice.txt. So the text in alice.txt is in utf-8 and contains two different types of ‘ symbols.
  
  One of them has a UTF code U+2009 and is called RIGHT SINGLE QUOTATION MARK, check this out: http://unicode.org/cldr/utility/character.jsp?a=2019
  
  The other one has a UTF code U+0027 and is called APOSTROPHE, check out http://unicode.org/cldr/utility/character.jsp?a=0027
  
  Although visibly the difference between the two is subtle, these are two different utf-8 characters. In your code, you have used APOSTROPHE, which is what comes up when you press that key. But alice.txt uses a RIGHT SINGLE QUOTATION MARK, instead of an APOSTROPHE, possibly because it was created with a document editor like Word. To make your tokenizer detect don’t, i’ll in alice.txt, you’ll have to add checking for this other symbol (U+2009) too in your regex. You can do the same by copying U+2009 symbol from the link above and escaping it using \ as you do normally.
  
  Hope this helps. For future reference, try entering the character of doubt here: http://unicode.org/cldr/utility/list-unicodeset.jsp
  Replace the text in Input box with whatever symbol whose utf-8 info you need. Thanks!
  
  LikeLike
  - spiff
    March 20, 2017 — 10:48 am
    
    Sweet! let me try that out.. thanks for your response.. on to Stemming now!
    
    LikeLiked by 1 person
Ra Ramana (@rravi)
March 16, 2017 — 7:55 am


Nice post. Surprisingly, regex = “([a-zA-Z]+)” yields different counts compared to regex = “[^\s]+” for the Alice text. Why is this ?

LikeLike
- codelingo (Post author)
  March 20, 2017 — 10:15 am
  
  
  Hello, so regex = “([a-zA-Z]+)” looks for one of more occurrences of alphabets (upper case and lower case) only.
  
  However regex = “[^\s]+” looks for one or more occurrences of non-whitespace characters. This includes alphabets, numbers, or any other special characters and punctuations. Thus there would be a difference in the counts between the two.
  
  LikeLike

CodeLingo

Tokenization

7 Comments

0 Pingbacks

Leave a comment Cancel reply

CodeLingo

Tokenization

Share this:

Related

7 Comments

0 Pingbacks

Leave a comment Cancel reply