(Level: Beginner)

This is our first post in which we’ll really get our hands dirty with some coding. Today’s concept is an extremely useful one – regular expressions or regex. Once you get started with regex, there’s no turning back. An extremely powerful concept, it can be used to do things like – batch renaming of files, checking whether a given bit of text is a valid phone number, scraping useful information from a webpage, correcting a mistake you made repeatedly in a file (or tens, hundreds, even thousands of files at a time), and MUCH more.

But what is it?
If you’re already familiar with what regular expressions are, you can skip this section. It just aims to provide a conceptual overview. We will post advanced tutorials soon.

Let’s say you have a large text document that describes the history of the automotive industry, and you want to find all occurrences of the word “Ford” in it. Simple enough right? Just use the Find feature of your document viewer and type in “Ford”. But now suppose you want to make a timeline that lists the important milestones of the industry, and you start by looking for all the years in the document. Use the Find feature of your browser and try to do this now. You’ll see the problem. How are you supposed to list out all the years? What do you even search for? It would be stupid to search for years at random, like “1908” or “1922”, when you don’t even know whether they exist in the document.

There seems to be a fundamental difference between looking for “Ford” and looking for years in the document. In the first case, you’re looking for a specific string – “Ford”. In the second case, you’re not looking for a specific string, but a class of strings. We can classify strings representing years because they have all have something in common – a sequence of some digits. “1908” is a sequence of digits. So is “1922”. So is “981723913890212”, but that’s still a long way into the future. The point is that these strings have a pattern associated with them, and patterns are the basis of regex.

A regex pattern can be as generic or as specific as you like. In fact, we will see that the string “Ford” is a regular expression as well, but a very specific one. The pattern is a string starting with an “F”, followed by an “o”, followed by an “r” and ending with a “d”. Let us formulate a pattern to extract years from the document. We know that years contain 4 digits (at least the years that are relevant to the automotive industry). So, we could define a pattern like this: digit, digit, digit, digit – that is, 4 digits in a row. A digit in the decimal number system is 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. Therefore, our regex would capture any 4 digit number in the document from 0000 to 9999. This would work, but we might get more than what we want. Let us restrict our search to years in the 20th century. Our regex would then be more specific – “1”, followed by “9”, followed by any digit, followed by any digit. This would match any year between 1900 and 1999. You see how we can make our regex more specific? We can make it more specific by adding fixed strings and characters (like “1” and “9”), and more generic by adding classes like digit, lowercase letter and uppercase letter.

Some regular expressions you can create, in increasing order of specificity:

 Numbers (most generic) -> Integers -> 4 digit positive integers -> Years -> 20th century years -> 1908 (most specific)

So finally, what is a regular expression? Let’s not get into official definitions – we’re not here to write an exam. You can think of a regular expression as a pattern of characters.

Regex Syntax
Now we’ll learn how to actually write a usable regular expression. Some of the common character classes are:

  • Digits, written as [0-9], or simply \d
  • Lowercase letters, written as [a-z]. If you only wanted to match the characters ‘g’ to ‘x’, you would use [g-x]. The hyphen denotes a range. The square brackets enclose a set of any type of character. For example, [aceg-x0-46-9] matches the character ‘a’, or ‘c’, or ‘e’, or any character between ‘g’ and ‘x’, or any digit between ‘0’ and ‘4’, or any digit between ‘6’ and ‘9’.
  • Uppercase letters, written as [A-Z].
  • A word character is a lowercase character, uppercase character, digit, or underscore. It is written as [a-zA-Z0-9_], or \w for short.
  • Whitespace (spaces, tabs and newlines), denoted by \s.
  • Any character at all, denoted by a full stop (.)
  • And more, which we’ll cover soon.

Quantifiers
There are certain symbols that can be used to indicate how many times a pattern must be matched. \d matches only a single digit character. What if we wanted to match 4 digits in a row? Or a sequence of 3 lowercase letters followed by any number of digits?

  • “+” matches one or more occurrence of the preceding token
  • “*” matches any number of occurrences of the preceding token
  • {m, n} matches between m and n occurrences of the preceding token
  • {n} matches exactly n occurrences of the preceding token
  • {m,} matches m or more occurrences of the preceding token

What do I do with this?
You can now search for patterns instead of just fixed strings. You can, for example:

  • find email addresses in a document
  • find hyphenated phrases, like “state-of-the-art”
  • find the content of a paragraph tag in an HTML document
  • find links in a webpage
  • in an astronomical document, find names of pulsars (a type of star) following the standard Pulsar nomenclature, i.e “PSR” followed by followed by the pulsar’s right ascension and degrees of declination (e.g. PSR 0531+21).
  • find strings starting with a lowercase letter, having the next two characters as digits, followed by any number of special characters except commas, forward slashes, dollar signs, brackets, and percentage symbols, followed by 5 to 8 word characters, but only if they appear at the end of sentences that do not begin with a digit, and only if such strings are found in .txt files in all directories on your computer that have the word “log” in them (although why you would want this is beyond me).

In short, using regular expressions, you can look for any text that follows some pattern. Sometimes the patterns are obvious – like in a telephone number or an email address – while others are more inconspicuous. At first glance, the string “oqiwdasads” doesn’t seem to follow any pattern, but it actually follows several: a sequence of lowercase letters of any length, a sequence of 10 lowercase letters, a sequence of ANY character appearing any number of times, a sequence of non-whitespace characters appearing 10 times.. and so on.

Coding
Regular expressions do not belong to any programming language in particular. Rather, the concept of regular expressions is implemented in different languages. Most languages handle regular expressions the same way, but sometimes there are differences. Often, a regex pattern that you write in your code will be usable in a different programming language without having to change anything. We recommend playing around on Regexr first before writing a program.

This is by no means a comprehensive guide to regular expressions. This article barely scrapes the surface, and is intended to introduce regular expressions to beginners. We will soon release more advanced tutorials.

To use regex in Python programs, import the module “re”. Regular expression matching in Python is a vast subject and deserves its own post. But here’s a simple program to get started. View Gist.

import re

# regex to search for years
regex = "\d{4}"
text = "In 1828, Anyos Jedlik, a Hungarian who invented an early type of electric motor, created a tiny model car powered by his new motor. In 1838, Scotsman Robert Davidson built an electric locomotive that attained a speed of 4 miles per hour (6 km/h)."

print(re.findall(regex, text))

The regular expression variable “regex” is \d{4}, which matches 4 digits in a row. The variable “text” contains the text on which we need to perform our search. The re.findall() method searches the text for all occurrences of regex, and returns an array of search results. The output of this program is:

['1828', '1838']

Note that regex searches return strings. ‘1828’ is not an integer but a string.

Let’s try another program. Search for all the capitalized words (like Hungarian and Robert) in the same sentence as before. Think of the pattern of these words and try to formulate the regex. View solution.

Advertisements