(Level: Intermediate)

We’re back! In the previous regex tutorial, we covered character classes and anchors in some detail. We also explained the use of raw strings when defining regular expressions in Python. Today’s post discusses quantifiers in detail and introduces the ideas of alternation and grouping, which are explained by building our own URL regex. It also has several regex challenges based on the concepts covered so far that will test your regex-building skills. Let’s begin.

First, let’s clear out a doubt you might be having about character sets. (I know I was a little unsure about this when I first started learning regex). Consider the following regular expressions:

regex1 = r"[lazy]"
regex2 = r"lazy"

What’s the difference between regex1 and regex2? In regex1, the characters l, a, z and y are enclosed in a character set, whereas in regex2, the characters are in a fixed string. What I want to draw attention to is the fact that a character set, like in regex1, by default matches only one characterBy using a character set, we’re saying we want to match any of the characters in this set appearing once. So, the following code:

import re

regex = r"[lazy]"
text = "I have a lazy cat"

print(re.findall(regex, text))

will produce the following output:

['a', 'a', 'l', 'a', 'z', 'y', 'a']

I have a lazy cat. Single characters in the sentence that belong to the set [l, a, z, y] are matched. Sets are unordered. This means that the following regular expressions are equivalent:

regex1 = r"[lazy]"
regex2 = r"[yzla]"
regex3 = r"[alyz]"
# and so on

However, the following code:

import re
regex = r"lazy" # without character set
text = "I have a lazy cat"

print(re.findall(regex, text))

will output:

['lazy']

The entire string “lazy” is matched exactly. None of “laz”, “azy”, or even “LAZY” would have been matched by this regex.

You aren’t going to get very far with regular expressions if you aren’t making use of quantifiers. Quantifiers give you control over the number of times characters appear in your regex. We have already introduced the “+” quantifier, which matches one or more of the preceding character, group, set or class (We can call all of these tokens instead of listing them out separately). If we use the “+” quantifier after a character set, it means, “match the longest continuous string that has characters from the character set”.

regex = r"[abc123]+"

# the following strings will be matched completely by the above regex
string1 = "333333"
string2 = "cab"
string3 = "321bca"

A character set by default matches only one character. When you apply a quantifier to it, you can match a sequence of characters instead of just one. However, simply adding a quantifier to a single character isn’t always powerful enough to match everything you need. There are many times when you would want to apply a quantifier to a group of characters. Let’s explore this idea by trying to create a simple regex from scratch to match website URLs.

URL Regex

Let’s say our requirement is to extract all .com URLs from a webpage without the resource path after .com (We just want https://mysite.com, not https://mysite.com/index.html). URLs may be written in many forms. They may or may not have http or https. They may or may not have “www”. They may have multiple subdomains, like in http://www.about.me.mysite.com. Let’s try to make a single regex to capture all of these.

Step 1 – Matching optional http/https://

First, to make things easy, assume that all the URLs we need begin http or https. This is the perfect time to talk about grouping and alternation. Alternation simply means that we can specify multiple alternatives to match in a regex. It’s like an OR in programming and is denoted by a vertical bar (|). So our regex starts out as:

# match http or https
url_regex = r"http|https"

Now, let’s bring back the restriction that the http/https part may or may not be present. Remember the “?” quantifier? It matches 1 or 0 occurrences of the previous token. In other words, by using this quantifier, the preceding token becomes optional. The token may exist (1 occurrence) or may not exist (0 occurrences). We want a part of the regex to mean, “the URL may or may not begin with http/https”. You probably realize that we need to use the “?” quantifier to achieve this. But it needs to be applied to the entire regex so far. We therefore need to group the http|https part and apply the “?” on the entire group. Grouping is used to separate a part of a regex so you can apply a quantifier exclusively on that group or restrict alternation to that group. We use parentheses to create groups. Our regex now becomes:

# http(s) may or may not occur in a URL
url_regex = r"(http|https)?"

However, whenever http or https appear in a URL, they are followed by :// (colon and two forward slashes). So, we want :// to be captured only if the URL contains http or https. How do we write this? Assume once again that all the URLs we need begin with http or https. Remember that certain special characters need to be escaped with a backslash in regular expressions. The forward slash is one of them. Our regex would then become:

# match http or https followed by ://
url_regex = r"(http|https):\/\/"

Finally, this entire part may or may not occur, so we group it and make it optional:

# http(s):// may or may not occur in a URL
url_regex = r"((http|https):\/\/)?"

Step 1.5 – Alternative way to write the regex

We can make our regex a bit smaller but mean the same thing with a clever use of the optional quantifier. Instead of using alternation to mean, “http or https”, we can use the optional quantifier to mean, “http followed by an optional ‘s’ character”. These are two ways of representing the same thing. Our regex changes to:

# these two are equivalent
url_regex_short = r"(https?:\/\/)?"
url_regex = r"((http|https):\/\/)?"

Step 2 – Matching optional www.

This is similar to step 1. “www.” may not appear in all URLs (https://wordpress.com), so we need to mark it as optional. “www.” can only appear after http/https, so we need to build the regex in that order:

# match optional http(s):// followed by optional www.
url_regex = r"(https?:\/\/)?(www\.)?"

Step 3 – Matching subdomains and hostname

Consider the URL https://codelingo.wordpress.com. So far, our regex would capture the highlighted part: https://codelingo.wordpress.com. Look at the groups in the following URL: https://www.abc.123.pqr-xyz.mysite1.com. In the underlined part, each subdomain is a string of letters, numbers and even special characters like hyphens, followed by a full stop:

Some letters followed by a dot:
abc.

Some digits followed by a dot:
123.

Some letters and a hyphen followed by a dot:
pqr-xyz.

Some letters and a digit followed by a dot:
mysite1.

In general, each subdomain can be represented as:
“Some characters (letters, digits, hyphens or some combination of these) followed by a dot. “

And since there can be multiple levels of subdomains, the regex matching a single subdomain should be followed by a “+” to allow for multiple subdomains.

So, our regex needs to have a part that means, “match any of such characters occurring at least once followed by a full stop, and such a group itself can occur multiple times”. The new regex is:

# match optional http(s) and optional www.
# followed by at least one group with the following pattern:
# letters, numbers, underscores and hyphens appearing at least once, followed by a full stop
url_regex = r"(https?:\/\/)?(www\.)?([\w\-]+\.)+"

We’re almost done! This regex will match the highlighted part of this URL: https://codelingo.wordpress.com.

Step 4 – Matching compulsory com at the end of the URL

All that’s left is to match the “com” at the end of URLs. Not “.com”, but “com”, because the “.” preceding the “com” will have already been matched by step 3. This is because the “.” preceding “com” is the same dot as the “.” after the main domain name. Our regex becomes:

# match optional http(s) and optional www.
# followed by at least one group with the following pattern:
# letters, numbers, underscores and hyphens appearing at least once, followed by a full stop
# followed by a com
regex = r"(https?:\/\/)?(www\.)?([\w\-]+\.)+com"

And that’s it! We could add a word boundary anchor (\b) at the end of the regex to avoid capturing false positives like “i.was.in.a.coma” (yes, I know that’s a very contrived example). In this example, only the highlighted part gets matched, and it makes it look like i.was.in.a.com is a URL. In fact, there are many things we can add to a regex to make it more restrictive or mould it to our requirements, but this one does a rather satisfactory job already. Regexes can be written in many different ways and this is just one of the possibilities for a URL regex. You can test it out here.

Exercises

The following is a list of challenges that you can try to build a regex for. Remember that a challenge can have several solutions.

1) Words containing the letter ‘e’ (or any other letter):

# words containing the letter 'e'
regex = r"\b[a-zA-Z]*[eE][a-zA-Z]*\b"
# try it here: http://regexr.com/3fmj6

2) Words beginning with a consonant and ending with a vowel:

# words beginning with a consonant and ending with a vowel
regex = r"\b[^aeiouAEIOU\s][a-zA-Z]*[aeiouAEIOU]\b"
# try it here: http://regexr.com/3fmj3

3) Positive and negative decimal numbers:

# positive and negative decimal numbers
regex = r"-?(\d+(,\d{3,})*(\.[\d]+)?|\.[\d]+)"
# try it here: http://regexr.com/3fmji

4) Names of video files (mp4/flv/mov/wmv) from a list of files

# names of video files from a list of files
regex = r"[\w].*(mp4|flv|mov|wmv)"
# try it here: http://regexr.com/3fmjr

5) Hex colour codes in a CSS file (They take the form #XXX or #XXXXXX, where X = hexadecimal digit)

# hex colour codes in a CSS file
regex = r"#[a-fA-F0-9]{3}([a-fA-F0-9]{3})?"
# try it here: http://regexr.com/3fmkd