import pandas as pd
import regex as re

RegEx#

RegEx, standing for regular expressions, are a powerful tool used to search through large bodies of text. Using RegEx, you can essentially specify patterns you are looking for and get all the substrings which match that pattern. This subchapter will go over how to use RegEx in Python.

As always, this chapter is not comprehensive and not a good substitute for plenty of practice! There are a ton of resources for learning RegEx, some of them are: -

RegEx Syntax#

RegEx patterns are made up of literal characters and special characters. Literal characters are literally the characters they represent. For example, in the code below, the characters Economics match the string Economics literally and return a match.

Note: re.search() takes in a pattern and a string and returns whether the pattern has a match in the string. This is talked about more in the Using Regex in Python section.

re.search('Economics','Economics')
<regex.Match object; span=(0, 9), match='Economics'>

As the characters are literal, the characters Economics do not match the string Economica. You can see nothing is returned, meaning no match was found.

re.search('Economics','Economica')

If we wanted to also match Economica, we could modify our pattern to be Economic(s|a). | is the ‘or’ operator here, indicating that either of the options will be acceptable. Since the parenthesis surround s|a, this means that the pattern will only match Economics or Economica, it will not match other patterns like just a or just Economic.

re.search('Economic(s|a)','Economica')
<regex.Match object; span=(0, 9), match='Economica'>
re.search('Economic(s|a)','Economics')
<regex.Match object; span=(0, 9), match='Economics'>
re.search('Economic(s|a)','a')
re.search('Economic(s|a)','Economic')

Quantifiers#

There are several ways of telling RegEx how many of a character we want. They are summarized below:

Operation

Description

Syntax Example

Matches

Doesn’t Match

*

Matches the preceding character 0 or more times

eric*

eri
eric
ericc

er
ers
Eric

+

Matches the preceding character 1 or more times

eric+

eric
ericc

eri
er
Eric

?

Matches the preceding character 0 or 1 times

eric?

eri
eric

Everything else

{m}

Matches the preceding character for m occurrences

eric{2}

ericc

Everything else

{m, n}

Matches the preceding character for between m and n occurrences

eric{1, 2}

eric
ericc

Everything else

Something important to note in the above table is that all of these operators only apply to just the preceding character. So, eric* will match eri and ericc but not eriicc or erii. If we also wanted to match eriicc or erii, we could use character classes.

Greediness#

RegEx operations are greedy, meaning they look for the longest possible match in a string. For example, given the string ab, the RegEx pattern .* could match a, b, ab, or nothing at all. However, because it wants the longest match possible, it will choose to match ab.

print(re.search(r".*","ab"))
<regex.Match object; span=(0, 2), match='ab'>

If we instead wanted to match as little as possible when using quantifiers, we can simply add ? after the quantifier to make it instead match as little as possible. For example, given the string ab, the RegEx pattern .* matches nothing at all.

print(re.search(r".*?","ab"))
<regex.Match object; span=(0, 0), match=''>

Here’s another example of greedy vs non-greedy matching:

print(re.search(r"a{1,2}","aa")) # greedy matching returns aa
print(re.search(r"a{1,2}?","aa")) # non-greedy matching returns a
<regex.Match object; span=(0, 2), match='aa'>
<regex.Match object; span=(0, 1), match='a'>

Character Classes#

RegEx character classes help us match any one character from a list of characters. They are often denoted by a list of characters surrounded by hard brackets ([]). For example, er[ic]* will match eri, ericc, eriicc and erii.

print(re.search(r"er[ic]*","eri"))
print(re.search(r"er[ic]*","ericc"))
print(re.search(r"er[ic]*","erii"))
print(re.search(r"er[ic]*","eriicc"))
<regex.Match object; span=(0, 3), match='eri'>
<regex.Match object; span=(0, 5), match='ericc'>
<regex.Match object; span=(0, 4), match='erii'>
<regex.Match object; span=(0, 6), match='eriicc'>

You can also use character classes to denote characters you do not wish to match. If you put ^ at the front of a character class, it becomes a character class negation, and matches anything except the characters in the character class. For example, the pattern consonant: [^aeiou] will match any string of the pattern consonant: followed by a consonant, as shown below.

print(re.search(r"consonant: [^aeiou]","consonant: a"))
print(re.search(r"consonant: [^aeiou]","consonant: b"))
print(re.search(r"consonant: [^aeiou]","consonant: d"))
print(re.search(r"consonant: [^aeiou]","consonant: u"))
print(re.search(r"consonant: [^aeiou]","consonant: $"))
<regex.Match object; span=(0, 12), match='consonant: $'>

Character Class

Description

Matches

Doesn’t Match

[a-z]

All lowercase letters

a
m
z

A
1
$

[A-Z]

All uppercase letters

A
M
Z

a
1
$

[0-9]

Digits

0
5
9

a
A
$

\w

Word characters (letters + digits + underscore)

a
5
_

$
%
-

\d

Digits

0
7
9

a
A
$

\s

All whitespace characters, including tabs and new lines

space
tab
newline

A
1
$

.

All characters except newline

space
a
$

newline

\W, \D, and \S are the complements (match the exact opposite things) of \w, \d, and \s respectively.

Positions#

When discussing character classes, we’d mentioned how ^ can be used to negate a character class. However, ^ actually has two uses. If the ^ is not in the middle of a bracket and preceded by [, it instead represents a position at the beginning of a string. For example, the pattern ^a will contain a match in all but not in ball.

print(re.search(r"^a","all"))
print(re.search(r"^a","ball"))
<regex.Match object; span=(0, 1), match='a'>
None

Similarly, $ denotes the end of a string. For example, the pattern s$ will contain a match in prawns but not in prawn.

print(re.search(r"s$","prawns"))
print(re.search(r"s$","prawn"))
<regex.Match object; span=(5, 6), match='s'>
None

Escape#

The escape character \ is used to take special characters and turn them into literal ones. For example, the character . will match anything except a newline, but \. will only match . and nothing else.

print(re.search(r".","%"))
print(re.search(r"\.","."))
print(re.search(r"\.","%"))
<regex.Match object; span=(0, 1), match='%'>
<regex.Match object; span=(0, 1), match='.'>
None

Using Regex in Python#

Regex Library Functions#

The regex library of Python has several functions that are helpful for working with Python strings. Until now, we’ve been working with re.search(). re.search() takes in a RegEx pattern and a Python string, and returns the first match of the pattern within the string. It returns the match as a regex.Match object. regex.Match object's have 4 helpful methods:

  • .start(): Returns the index where the match begins

  • .end(): Returns the index where the match ends

  • .span(): Returns how many characters the match spans

  • .group(): Returns the match itself.

All of these are demonstrated below.

samp_match = re.search(r"[^a].*","abcd") # Don't match the a, but match everything after the a
samp_match
<regex.Match object; span=(1, 4), match='bcd'>
samp_match.start()
1
samp_match.end()
4
samp_match.span()
(1, 4)
samp_match.group()
'bcd'

The other python RegEx functions we will discuss return either lists or strings. They are described below.

Function

Description

Syntax Example

Returns

Example Usage

Example Output

re.search()

Searches for the first occurrence of a pattern anywhere in the string

re.search(pattern, string)

Match object or None

re.search(r'\d+', 'There are 25 apples and 30 oranges')

Match object: <regex.Match object; span=(10, 12), match=’25’>

re.findall()

Finds all occurrences of the pattern in the string

re.search(pattern, string)

List of matches or empty list

re.findall(r'\d+', 'There are 25 apples and 30 oranges')

List of matches: ['25', '30']

re.sub()

Substitutes occurrences of a pattern with a replacement string

re.sub(pattern, replacement, string)

Modified string

re.sub(r'\d+', 'X', 'There are 25 apples and 30 oranges')

Modified string: 'There are X apples and X oranges'

re.split()

Splits a string based on a specified pattern

re.search(pattern, string)

List of substrings

re.split(r'\d+', 'There are 25 apples and 30 oranges')

List of substrings: ['There are ', ' apples and ', ' oranges']

As always, this list is not comprehensive and you are encouraged to read the documentation for more details.

Capturing Groups#

Using re.findall() allows us to easily extract information from our string. Capturing groups can help us extract information further.

For example, let us say we want to extract the numbers from the statement There are 25 apples and 30 oranges. We can design a pattern (such as [a-zA-S\s]*\d+[a-zA-S\s]*) which isolates the numbers, but then how do we extract them?

re.findall(r'[a-zA-S\s]*\d+[a-zA-S\s]*', 'There are 25 apples and 30 oranges')
['here are 25 apples and ', '30 oranges']

Well, if we surround a portion of the pattern with parenthesis, we can capture that portion and extract only that portion into our final list.

re.findall(r'[a-zA-S\s]*(\d+)[a-zA-S\s]*', 'There are 25 apples and 30 oranges')
['25', '30']

This is tricky as earlier, we’d introduced parenthesis as forms of specifying which letters we want to choose between.

re.search('Economic(s|a)','Economica')
<regex.Match object; span=(0, 9), match='Economica'>

This brings us to a key point: when extracting information with re.findall(), parentheses serve as capturing groups. However, when simply matching the first occurrence with re.search() without extracting anything, parentheses instead just help group characters together. This is something to be careful of; it is possible to rework any RegEx expression to not use parenthesis where they are not necessary. For example, you could transform the pattern Economic(s|a) to Economic[sa] and then use re.findall() to accomplish the same result as the re.search() call above.

re.findall('Economic[sa]','Economica')
['Economica']
text = "My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789."
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)  
['123-45-6789', '321-45-6789']

Integration with Pandas String Methods#

RegEx patterns work very well with pandas string methods. For example, the RegEx pattern ssn_pattern below matches all strings that follow the pattern of an American social security number (namely: 3 digits, dash, 2 digits, dash, 3 digits). Let us imagine you’ve surveyed a bunch of (American) people asking them their social security number. You’ve stored all their responses in the DataFrame ssn_data below. You can use str.findall() to extract all potential the social security numbers.

ssn_pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
# SSN Pattern: 3 digits, dash, 2 digits, dash, 3 digits
data = {
    "SSN": ["000-00-0000", "error", "876-54-3210 invalid or 210-54-3210", "111-11-1111", 
            "I refuse to provide", "456-78-9012", "abc-12-3456", "Nice try, FBI agent","777-77-7777", 
    "seventy-seven", "987-65-4321 please or 876-54-3210", "222-22-2222", "classified information", 
            "135-79-2468", "xyz-98-7654", "I only give my SSN to aliens","555-55-5555", "fifty-five", 
            "345-67-8901 not sure or 789-01-2345", "333-33-3333", "I'm a robot", "678-90-1234", 
            "aaa-11-2222", "Ask my cat for the SSN", "444-44-4444", "forty-four", 
            "234-56-7890 maybe or 876-54-3210", "444-44-4444", "top secret", "789-01-2345", 
            "bbb-22-3333", "I only speak in emojis 🤖", "666-66-6666", "sixty-six", 
            "432-10-9876 not applicable or 765-43-2109", "555-55-5555", "I'm a ghost", "012-34-5678", 
            "ccc-33-4444", "SSN? What's that?", "111-11-1111", "eleven", 
            "543-21-0987 avoid or 890-12-3456", "666-66-6666", "I'm on the run", "345-67-8901", 
            "ddd-44-5555", "My SSN is a secret code", "999-99-9999", "ninety-nine", 
            "654-32-1098 dodge or 210-98-7654", "777-77-7777", "I'm a spy", "567-89-0123", 
            "eee-55-6666", "I'll give you my SSN if you beat me in a dance-off", "123-45-6789", 
            "twelve", "765-43-2109 sidestep or 098-76-5432", "888-88-8888", "I'm an alien", 
            "678-90-1234", "fff-66-7777", "SSN? I speak in riddles", "234-56-7890", "twenty-four", 
            "876-54-3210 escape or 765-43-2109", "999-99-9999", "I'm a time traveler", "789-01-2345", 
            "ggg-77-8888", "My SSN is a password", "345-67-8901", "thirty-four", 
            "987-65-4321 run or 876-54-3210", "000-00-0000", "I'm a superhero", 
            "012-34-5678", "hhh-88-9999", "You'll never guess my SSN"]
}
ssn_data = pd.DataFrame(data)
ssn_data
SSN
0 000-00-0000
1 error
2 876-54-3210 invalid or 210-54-3210
3 111-11-1111
4 I refuse to provide
... ...
75 000-00-0000
76 I'm a superhero
77 012-34-5678
78 hhh-88-9999
79 You'll never guess my SSN

80 rows × 1 columns

Applying the findall function;

ssn_data["SSN"].str.findall(ssn_pattern)
0     [000-00-0000]
1                []
2     [210-54-3210]
3     [111-11-1111]
4                []
          ...      
75    [000-00-0000]
76               []
77    [012-34-5678]
78               []
79               []
Name: SSN, Length: 80, dtype: object

If you were just interested in the digits of the social security numbers, you could use capturing groups combined with .str.extract(), as shown below.

capturing_ssn_pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
ssn_data["SSN"].str.extract(capturing_ssn_pattern) 
# Note: this just extracts the first matching pattern
0 1 2
0 000 00 0000
1 NaN NaN NaN
2 876 54 3210
3 111 11 1111
4 NaN NaN NaN
... ... ... ...
75 000 00 0000
76 NaN NaN NaN
77 012 34 5678
78 NaN NaN NaN
79 NaN NaN NaN

80 rows × 3 columns