import pandas as pd
import regex as re
RegEx#
RegEx
, standing for regular expressions, are a powerful tool used to search through large bodies of text. Using RegEx, you can essentially specify patterns you are looking for and get all the substrings which match that pattern. This subchapter will go over how to use RegEx in Python.
As always, this chapter is not comprehensive and not a good substitute for plenty of practice! There are a ton of resources for learning RegEx, some of them are: -
-
Be sure to choose the
Python
flavour under the category on the left.
RegEx Syntax#
RegEx patterns are made up of literal characters and special characters. Literal characters are literally the characters they represent. For example, in the code below, the characters Economics
match the string Economics
literally and return a match.
Note: re.search()
takes in a pattern and a string and returns whether the pattern has a match in the string. This is talked about more in the Using Regex in Python
section.
re.search('Economics','Economics')
<regex.Match object; span=(0, 9), match='Economics'>
As the characters are literal, the characters Economics
do not match the string Economica
. You can see nothing is returned, meaning no match was found.
re.search('Economics','Economica')
If we wanted to also match Economica
, we could modify our pattern to be Economic(s|a)
. |
is the ‘or’ operator here, indicating that either of the options will be acceptable. Since the parenthesis surround s|a
, this means that the pattern will only match Economics
or Economica
, it will not match other patterns like just a
or just Economic
.
re.search('Economic(s|a)','Economica')
<regex.Match object; span=(0, 9), match='Economica'>
re.search('Economic(s|a)','Economics')
<regex.Match object; span=(0, 9), match='Economics'>
re.search('Economic(s|a)','a')
re.search('Economic(s|a)','Economic')
Quantifiers#
There are several ways of telling RegEx how many of a character we want. They are summarized below:
Operation |
Description |
Syntax Example |
Matches |
Doesn’t Match |
---|---|---|---|---|
|
Matches the preceding character 0 or more times |
eric* |
eri |
er |
|
Matches the preceding character 1 or more times |
eric+ |
eric |
eri |
|
Matches the preceding character 0 or 1 times |
eric? |
eri |
Everything else |
|
Matches the preceding character for m occurrences |
eric{2} |
ericc |
Everything else |
|
Matches the preceding character for between m and n occurrences |
eric{1, 2} |
eric |
Everything else |
Something important to note in the above table is that all of these operators only apply to just the preceding character. So, eric*
will match eri and ericc but not eriicc or erii. If we also wanted to match eriicc or erii, we could use character classes.
Greediness#
RegEx operations are greedy, meaning they look for the longest possible match in a string. For example, given the string ab, the RegEx pattern .*
could match a, b, ab, or nothing at all. However, because it wants the longest match possible, it will choose to match ab.
print(re.search(r".*","ab"))
<regex.Match object; span=(0, 2), match='ab'>
If we instead wanted to match as little as possible when using quantifiers, we can simply add ?
after the quantifier to make it instead match as little as possible. For example, given the string ab, the RegEx pattern .*
matches nothing at all.
print(re.search(r".*?","ab"))
<regex.Match object; span=(0, 0), match=''>
Here’s another example of greedy vs non-greedy matching:
print(re.search(r"a{1,2}","aa")) # greedy matching returns aa
print(re.search(r"a{1,2}?","aa")) # non-greedy matching returns a
<regex.Match object; span=(0, 2), match='aa'>
<regex.Match object; span=(0, 1), match='a'>
Character Classes#
RegEx character classes help us match any one character from a list of characters. They are often denoted by a list of characters surrounded by hard brackets ([]
). For example, er[ic]* will match eri, ericc, eriicc and erii.
print(re.search(r"er[ic]*","eri"))
print(re.search(r"er[ic]*","ericc"))
print(re.search(r"er[ic]*","erii"))
print(re.search(r"er[ic]*","eriicc"))
<regex.Match object; span=(0, 3), match='eri'>
<regex.Match object; span=(0, 5), match='ericc'>
<regex.Match object; span=(0, 4), match='erii'>
<regex.Match object; span=(0, 6), match='eriicc'>
You can also use character classes to denote characters you do not wish to match. If you put ^
at the front of a character class, it becomes a character class negation, and matches anything except the characters in the character class. For example, the pattern consonant: [^aeiou]
will match any string of the pattern consonant:
followed by a consonant, as shown below.
print(re.search(r"consonant: [^aeiou]","consonant: a"))
print(re.search(r"consonant: [^aeiou]","consonant: b"))
print(re.search(r"consonant: [^aeiou]","consonant: d"))
print(re.search(r"consonant: [^aeiou]","consonant: u"))
print(re.search(r"consonant: [^aeiou]","consonant: $"))
<regex.Match object; span=(0, 12), match='consonant: $'>
Character Class |
Description |
Matches |
Doesn’t Match |
---|---|---|---|
|
All lowercase letters |
a |
A |
|
All uppercase letters |
A |
a |
|
Digits |
0 |
a |
|
Word characters (letters + digits + underscore) |
a |
$ |
|
Digits |
0 |
a |
|
All whitespace characters, including tabs and new lines |
space |
A |
|
All characters except newline |
space |
newline |
\W
, \D
, and \S
are the complements (match the exact opposite things) of \w
, \d
, and \s
respectively.
Positions#
When discussing character classes, we’d mentioned how ^
can be used to negate a character class. However, ^
actually has two uses. If the ^
is not in the middle of a bracket and preceded by [
, it instead represents a position at the beginning of a string. For example, the pattern ^a
will contain a match in all
but not in ball
.
print(re.search(r"^a","all"))
print(re.search(r"^a","ball"))
<regex.Match object; span=(0, 1), match='a'>
None
Similarly, $
denotes the end of a string. For example, the pattern s$
will contain a match in prawns but not in prawn.
print(re.search(r"s$","prawns"))
print(re.search(r"s$","prawn"))
<regex.Match object; span=(5, 6), match='s'>
None
Escape#
The escape character \
is used to take special characters and turn them into literal ones. For example, the character .
will match anything except a newline, but \.
will only match .
and nothing else.
print(re.search(r".","%"))
print(re.search(r"\.","."))
print(re.search(r"\.","%"))
<regex.Match object; span=(0, 1), match='%'>
<regex.Match object; span=(0, 1), match='.'>
None
Using Regex in Python#
Regex Library Functions#
The regex
library of Python has several functions that are helpful for working with Python strings. Until now, we’ve been working with re.search()
. re.search()
takes in a RegEx pattern and a Python string, and returns the first match of the pattern within the string. It returns the match as a regex.Match object
. regex.Match object's
have 4 helpful methods:
.start()
: Returns the index where the match begins.end()
: Returns the index where the match ends.span()
: Returns how many characters the match spans.group()
: Returns the match itself.
All of these are demonstrated below.
samp_match = re.search(r"[^a].*","abcd") # Don't match the a, but match everything after the a
samp_match
<regex.Match object; span=(1, 4), match='bcd'>
samp_match.start()
1
samp_match.end()
4
samp_match.span()
(1, 4)
samp_match.group()
'bcd'
The other python RegEx functions we will discuss return either lists or strings. They are described below.
Function |
Description |
Syntax Example |
Returns |
Example Usage |
Example Output |
---|---|---|---|---|---|
Searches for the first occurrence of a pattern anywhere in the string |
|
Match object or |
|
Match object: <regex.Match object; span=(10, 12), match=’25’> |
|
Finds all occurrences of the pattern in the string |
|
List of matches or empty list |
|
List of matches: |
|
Substitutes occurrences of a pattern with a replacement string |
|
Modified string |
|
Modified string: |
|
Splits a string based on a specified pattern |
|
List of substrings |
|
List of substrings: |
As always, this list is not comprehensive and you are encouraged to read the documentation for more details.
Capturing Groups#
Using re.findall()
allows us to easily extract information from our string. Capturing groups can help us extract information further.
For example, let us say we want to extract the numbers from the statement There are 25 apples and 30 oranges. We can design a pattern (such as [a-zA-S\s]*\d+[a-zA-S\s]*
) which isolates the numbers, but then how do we extract them?
re.findall(r'[a-zA-S\s]*\d+[a-zA-S\s]*', 'There are 25 apples and 30 oranges')
['here are 25 apples and ', '30 oranges']
Well, if we surround a portion of the pattern with parenthesis, we can capture that portion and extract only that portion into our final list.
re.findall(r'[a-zA-S\s]*(\d+)[a-zA-S\s]*', 'There are 25 apples and 30 oranges')
['25', '30']
This is tricky as earlier, we’d introduced parenthesis as forms of specifying which letters we want to choose between.
re.search('Economic(s|a)','Economica')
<regex.Match object; span=(0, 9), match='Economica'>
This brings us to a key point: when extracting information with re.findall()
, parentheses serve as capturing groups. However, when simply matching the first occurrence with re.search()
without extracting anything, parentheses instead just help group characters together. This is something to be careful of; it is possible to rework any RegEx expression to not use parenthesis where they are not necessary. For example, you could transform the pattern Economic(s|a)
to Economic[sa]
and then use re.findall()
to accomplish the same result as the re.search()
call above.
re.findall('Economic[sa]','Economica')
['Economica']
text = "My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789."
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)
['123-45-6789', '321-45-6789']
Integration with Pandas
String Methods#
RegEx patterns work very well with pandas
string methods. For example, the RegEx pattern ssn_pattern
below matches all strings that follow the pattern of an American social security number (namely: 3 digits, dash, 2 digits, dash, 3 digits). Let us imagine you’ve surveyed a bunch of (American) people asking them their social security number. You’ve stored all their responses in the DataFrame ssn_data
below. You can use str.findall()
to extract all potential the social security numbers.
ssn_pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
# SSN Pattern: 3 digits, dash, 2 digits, dash, 3 digits
data = {
"SSN": ["000-00-0000", "error", "876-54-3210 invalid or 210-54-3210", "111-11-1111",
"I refuse to provide", "456-78-9012", "abc-12-3456", "Nice try, FBI agent","777-77-7777",
"seventy-seven", "987-65-4321 please or 876-54-3210", "222-22-2222", "classified information",
"135-79-2468", "xyz-98-7654", "I only give my SSN to aliens","555-55-5555", "fifty-five",
"345-67-8901 not sure or 789-01-2345", "333-33-3333", "I'm a robot", "678-90-1234",
"aaa-11-2222", "Ask my cat for the SSN", "444-44-4444", "forty-four",
"234-56-7890 maybe or 876-54-3210", "444-44-4444", "top secret", "789-01-2345",
"bbb-22-3333", "I only speak in emojis 🤖", "666-66-6666", "sixty-six",
"432-10-9876 not applicable or 765-43-2109", "555-55-5555", "I'm a ghost", "012-34-5678",
"ccc-33-4444", "SSN? What's that?", "111-11-1111", "eleven",
"543-21-0987 avoid or 890-12-3456", "666-66-6666", "I'm on the run", "345-67-8901",
"ddd-44-5555", "My SSN is a secret code", "999-99-9999", "ninety-nine",
"654-32-1098 dodge or 210-98-7654", "777-77-7777", "I'm a spy", "567-89-0123",
"eee-55-6666", "I'll give you my SSN if you beat me in a dance-off", "123-45-6789",
"twelve", "765-43-2109 sidestep or 098-76-5432", "888-88-8888", "I'm an alien",
"678-90-1234", "fff-66-7777", "SSN? I speak in riddles", "234-56-7890", "twenty-four",
"876-54-3210 escape or 765-43-2109", "999-99-9999", "I'm a time traveler", "789-01-2345",
"ggg-77-8888", "My SSN is a password", "345-67-8901", "thirty-four",
"987-65-4321 run or 876-54-3210", "000-00-0000", "I'm a superhero",
"012-34-5678", "hhh-88-9999", "You'll never guess my SSN"]
}
ssn_data = pd.DataFrame(data)
ssn_data
SSN | |
---|---|
0 | 000-00-0000 |
1 | error |
2 | 876-54-3210 invalid or 210-54-3210 |
3 | 111-11-1111 |
4 | I refuse to provide |
... | ... |
75 | 000-00-0000 |
76 | I'm a superhero |
77 | 012-34-5678 |
78 | hhh-88-9999 |
79 | You'll never guess my SSN |
80 rows × 1 columns
Applying the findall
function;
ssn_data["SSN"].str.findall(ssn_pattern)
0 [000-00-0000]
1 []
2 [210-54-3210]
3 [111-11-1111]
4 []
...
75 [000-00-0000]
76 []
77 [012-34-5678]
78 []
79 []
Name: SSN, Length: 80, dtype: object
If you were just interested in the digits of the social security numbers, you could use capturing groups combined with .str.extract()
, as shown below.
capturing_ssn_pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
ssn_data["SSN"].str.extract(capturing_ssn_pattern)
# Note: this just extracts the first matching pattern
0 | 1 | 2 | |
---|---|---|---|
0 | 000 | 00 | 0000 |
1 | NaN | NaN | NaN |
2 | 876 | 54 | 3210 |
3 | 111 | 11 | 1111 |
4 | NaN | NaN | NaN |
... | ... | ... | ... |
75 | 000 | 00 | 0000 |
76 | NaN | NaN | NaN |
77 | 012 | 34 | 5678 |
78 | NaN | NaN | NaN |
79 | NaN | NaN | NaN |
80 rows × 3 columns