In the first regex post, we discussed the concept of regular expressions, some of their applications, and made a small program to extract years from a text. We also looked at some important character classes like uppercase letters, word characters, digits and whitespace. In this regex tutorial, we will learn in greater depth about character classes and anchors.
Let’s quickly review what we know about character classes. A character class is simply a set of characters that you can create and use in your regex to match text. Some important character classes like Digits, Word characters and Whitespace are predefined so that you don’t need to define them yourself. So, instead of using [0-9] for digits, you could use \d.
Note that some programs (like grep) don’t recognize \d as the character class for digits, so you will need to use [0-9] or whatever predefined class they use for digits (grep uses [:digit:]).
An important point to note about predefined character classes like \d, \w and \s is that if you use the uppercase form, the class gets negated. So \d is the class of all digits, while \D is the class of all non-digits. The whitespace class is represented by \s, while the non-whitespace character class is denoted by \S. What do you think would happen if your regex were “[\d\D]”?
What if you wanted to negate a character class you defined yourself? Simply use the “^” sign at the beginning of the character class. Let’s say you want to match consonants in a word that only contains lowercase letters of the alphabet. What are the ways you could do this? You could form a character class by listing out all the consonants directly – [bcdfghjklmnpqrstvwxyz]. A slightly more elegant solution would be [b-df-hj-np-z]. Or you could negate the set of vowels, like so: [^aeiou]. But be careful, because this last regex matches ANY character that isn’t a vowel, such digits and special characters.
When using special characters in regexs, it is a good idea to escape them with a backslash because they may have special meanings. For example, a full stop (.) is actually a character class that represents any character. If you wanted to match a literal full stop, you would have to use “\.”. The following are some characters that should definitely be escaped if you want to match them literally, since they have special meanings when you don’t escape them.
[ ] \ / ^ $ . | ? * + ( ) -
Anchors are special characters used in regular expressions to help you match text occurring at certain positions.
\b is an anchor that stands for word boundary. A word boundary is a position where a word typically begins or ends, like a punctuation mark, special character or whitespace character. \B is an anchor that negates word boundaries.
Let’s say you wanted to extract words ending in -ous from a given text. You want to capture words like fabulous and enormous, but not words like mouse or rousing. Think of the pattern of these words. It’s a sequence of letters followed by ous, and nothing appearing after -ous. Our regex could be “[a-z]+ous”, but this would also match words in which ous appears in the middle, because there is nothing in our regex that restricts what follows ous. We want ous to be at the end of the word, or at a word boundary, so we use “[a-z]+ous\b”.
import re # regex to match words ending with ous regex = r"[a-z]+ous\b" text = "The marvellous mouse ran into the enormous house." print(re.findall(regex, text))
The output of this program is:
You might be wondering about the line:
regex = r"[a-z]+ous\b"
What’s the letter r doing there at the beginning of the string?
Now’s a good time to talk about raw strings in Python. Normal strings in Python interpret the backslash character followed by another character as having a special meaning. For example, when you print “\n”, the output will be a newline. Raw strings, on the other hand, which are prefixed by an r or R, preserve the characters exactly as they are and do not consider special meanings. So when you print r”\n”, the output will be literally, “\n”, a backslash followed by an ‘n’.
Why are we using raw strings here?
Regex matching is done by the re module and not directly by the Python interpreter itself. In Python, “\b” is interpreted as a backspace character, but the re module interprets it as a word boundary. Here is what would happen if we used a normal string instead of a raw string. First, we assign regex = “\b”. The Python interpreter interprets “\b” as a backspace character. So the value in the variable regex is now a backspace character instead of a literal backslash followed by a literal n. Now in the re.findall(regex, sometext) line, the regex variable is passed to the re module as a backspace character, and it tries to match that against sometext. This is not what we want! We want “\b” to be passed literally to the re module, so that the re module can interpret “\b” as a word boundary.
If all of that went over your head, the gist is this: we don’t want the Python interpreter to misinterpret our regexes before they reach the re module. Our regex should only be interpreted by the re module. Python shouldn’t touch our strings. So use raw strings when defining your regexes.
In our example, using a normal string wouldn’t return any matches because the \b gets interpreted as a backspace character. Try it!
Back to Anchors
We discussed the word boundary anchor. Two other frequently used anchors are “^” and “$”. The caret symbol “^” lets you match at the beginning of a line, while the dollar symbol “$” lets you match at the end. A line is an entire single row of text until a newline character. Don’t confuse a line with a sentence. A single line may have several sentences. A single line may also have just one character. All of these are separate lines:
Hello 2 The quick brown fox. The lazy blue whale.
Many text viewers and editors (and the browser you are reading this article on) will wrap a long line into multiple rows for better readability, but remember that it is still considered as a single line for regex matchers. For example, this entire paragraph is a line, although it may span multiple rows.
Suppose you have a text file that contains a list of all of your friends’ phone numbers, with each phone number on a different line. Each phone number is prefixed with an area code. Each phone number uses the format
where AAA is the 3-digit area code. Let’s say you want to call your friends living in Washington D.C, which has an area code of 202. Write a Python program to extract the phone numbers of all your friends living in Washington D.C. Remember, you only want phone numbers that start with 202. Examples of valid phone numbers:
Examples of invalid phone numbers:
Here is a text file containing random phone numbers that you can use to create and test your program. Note that if you use the Python’s read function to read the text file, the entire content of the file will be in one “line”. If you aren’t able to solve it, view the solution here.
With the concepts we’ve covered in part 1 and 2, you should be able to create regular expressions for many basic to intermediate patterns. The next regex tutorial will be a little more advanced and cover quantifiers, alternation and grouping.
All the programs and files used in this tutorial can be found in this Github repository.
Here are some resources for further reading:
Python regex documentation
Raw string and regular expression in Python – Stack Overflow
Does Python re module support word boundaries (\b)?