RegEx

RegEx#

RegEx, standing for regular expressions, are a powerful tool used to search through large bodies of text. Using RegEx, you can essentially specify patterns you are looking for and get all the substrings which match that pattern. This subchapter will go over how to use RegEx in Python.

As always, this chapter is not comprehensive and not a good substitute for plenty of practice! There are a ton of resources for learning RegEx, some of them are: -

Official Regex Guide
UC Berkeley’s Data 100’s Reference Sheet
Regex101.com
- Be sure to choose the Python flavour under the category on the left.

RegEx Syntax#

RegEx patterns are made up of literal characters and special characters. Literal characters are literally the characters they represent. For example, in the code below, the characters Economics match the string Economics literally and return a match.

Note: re.search() takes in a pattern and a string and returns whether the pattern has a match in the string. This is talked about more in the Using Regex in Python section.

re.search('Economics','Economics')

<regex.Match object; span=(0, 9), match='Economics'>

As the characters are literal, the characters Economics do not match the string Economica. You can see nothing is returned, meaning no match was found.

re.search('Economics','Economica')

If we wanted to also match Economica, we could modify our pattern to be Economic(s|a). | is the ‘or’ operator here, indicating that either of the options will be acceptable. Since the parenthesis surround s|a, this means that the pattern will only match Economics or Economica, it will not match other patterns like just a or just Economic.

re.search('Economic(s|a)','Economica')

<regex.Match object; span=(0, 9), match='Economica'>

re.search('Economic(s|a)','Economics')

<regex.Match object; span=(0, 9), match='Economics'>

re.search('Economic(s|a)','a')

re.search('Economic(s|a)','Economic')

Quantifiers#

There are several ways of telling RegEx how many of a character we want. They are summarized below:

Operation	Description	Syntax Example	Matches	Doesn’t Match
`*`	Matches the preceding character 0 or more times	eric*	eri eric ericc	er ers Eric
`+`	Matches the preceding character 1 or more times	eric+	eric ericc	eri er Eric
`?`	Matches the preceding character 0 or 1 times	eric?	eri eric	Everything else
`{m}`	Matches the preceding character for m occurrences	eric{2}	ericc	Everything else
`{m, n}`	Matches the preceding character for between m and n occurrences	eric{1, 2}	eric ericc	Everything else

Something important to note in the above table is that all of these operators only apply to just the preceding character. So, eric* will match eri and ericc but not eriicc or erii. If we also wanted to match eriicc or erii, we could use character classes.

Greediness#

RegEx operations are greedy, meaning they look for the longest possible match in a string. For example, given the string ab, the RegEx pattern .* could match a, b, ab, or nothing at all. However, because it wants the longest match possible, it will choose to match ab.

print(re.search(r".*","ab"))

<regex.Match object; span=(0, 2), match='ab'>

If we instead wanted to match as little as possible when using quantifiers, we can simply add ? after the quantifier to make it instead match as little as possible. For example, given the string ab, the RegEx pattern .* matches nothing at all.

print(re.search(r".*?","ab"))

<regex.Match object; span=(0, 0), match=''>

Here’s another example of greedy vs non-greedy matching:

print(re.search(r"a{1,2}","aa")) # greedy matching returns aa
print(re.search(r"a{1,2}?","aa")) # non-greedy matching returns a

<regex.Match object; span=(0, 2), match='aa'>
<regex.Match object; span=(0, 1), match='a'>

Character Classes#

RegEx character classes help us match any one character from a list of characters. They are often denoted by a list of characters surrounded by hard brackets ([]). For example, er[ic]* will match eri, ericc, eriicc and erii.

print(re.search(r"er[ic]*","eri"))
print(re.search(r"er[ic]*","ericc"))
print(re.search(r"er[ic]*","erii"))
print(re.search(r"er[ic]*","eriicc"))

<regex.Match object; span=(0, 3), match='eri'>
<regex.Match object; span=(0, 5), match='ericc'>
<regex.Match object; span=(0, 4), match='erii'>
<regex.Match object; span=(0, 6), match='eriicc'>

You can also use character classes to denote characters you do not wish to match. If you put ^ at the front of a character class, it becomes a character class negation, and matches anything except the characters in the character class. For example, the pattern consonant: [^aeiou] will match any string of the pattern consonant: followed by a consonant, as shown below.

print(re.search(r"consonant: [^aeiou]","consonant: a"))
print(re.search(r"consonant: [^aeiou]","consonant: b"))
print(re.search(r"consonant: [^aeiou]","consonant: d"))
print(re.search(r"consonant: [^aeiou]","consonant: u"))
print(re.search(r"consonant: [^aeiou]","consonant: $"))

<regex.Match object; span=(0, 12), match='consonant: $'>

Character Class	Description	Matches	Doesn’t Match
`[a-z]`	All lowercase letters	a m z	A 1 $
`[A-Z]`	All uppercase letters	A M Z	a 1 $
`[0-9]`	Digits	0 5 9	a A $
`\w`	Word characters (letters + digits + underscore)	a 5 _	$ % -
`\d`	Digits	0 7 9	a A $
`\s`	All whitespace characters, including tabs and new lines	space tab newline	A 1 $
`.`	All characters except newline	space a $	newline

\W, \D, and \S are the complements (match the exact opposite things) of \w, \d, and \s respectively.

Positions#

When discussing character classes, we’d mentioned how ^ can be used to negate a character class. However, ^ actually has two uses. If the ^ is not in the middle of a bracket and preceded by [, it instead represents a position at the beginning of a string. For example, the pattern ^a will contain a match in all but not in ball.

print(re.search(r"^a","all"))
print(re.search(r"^a","ball"))

<regex.Match object; span=(0, 1), match='a'>
None

Similarly, $ denotes the end of a string. For example, the pattern s$ will contain a match in prawns but not in prawn.

print(re.search(r"s$","prawns"))
print(re.search(r"s$","prawn"))

<regex.Match object; span=(5, 6), match='s'>
None

Escape#

The escape character \ is used to take special characters and turn them into literal ones. For example, the character . will match anything except a newline, but \. will only match . and nothing else.

print(re.search(r".","%"))
print(re.search(r"\.","."))
print(re.search(r"\.","%"))

<regex.Match object; span=(0, 1), match='%'>
<regex.Match object; span=(0, 1), match='.'>
None

Using Regex in Python#

Regex Library Functions#

The regex library of Python has several functions that are helpful for working with Python strings. Until now, we’ve been working with re.search(). re.search() takes in a RegEx pattern and a Python string, and returns the first match of the pattern within the string. It returns the match as a regex.Match object. regex.Match object's have 4 helpful methods:

.start(): Returns the index where the match begins
.end(): Returns the index where the match ends
.span(): Returns how many characters the match spans
.group(): Returns the match itself.

All of these are demonstrated below.

samp_match = re.search(r"[^a].*","abcd") # Don't match the a, but match everything after the a
samp_match

<regex.Match object; span=(1, 4), match='bcd'>

samp_match.start()

samp_match.end()

samp_match.span()

(1, 4)

samp_match.group()

'bcd'

The other python RegEx functions we will discuss return either lists or strings. They are described below.

Function	Description	Syntax Example	Returns	Example Usage	Example Output
`re.search()`	Searches for the first occurrence of a pattern anywhere in the string	`re.search(pattern, string)`	Match object or `None`	`re.search(r'\d+', 'There are 25 apples and 30 oranges')`	Match object: <regex.Match object; span=(10, 12), match=’25’>
`re.findall()`	Finds all occurrences of the pattern in the string	`re.search(pattern, string)`	List of matches or empty list	`re.findall(r'\d+', 'There are 25 apples and 30 oranges')`	List of matches: `['25', '30']`
`re.sub()`	Substitutes occurrences of a pattern with a replacement string	`re.sub(pattern, replacement, string)`	Modified string	`re.sub(r'\d+', 'X', 'There are 25 apples and 30 oranges')`	Modified string: `'There are X apples and X oranges'`
`re.split()`	Splits a string based on a specified pattern	`re.search(pattern, string)`	List of substrings	`re.split(r'\d+', 'There are 25 apples and 30 oranges')`	List of substrings: `['There are ', ' apples and ', ' oranges']`

As always, this list is not comprehensive and you are encouraged to read the documentation for more details.

Capturing Groups#

Using re.findall() allows us to easily extract information from our string. Capturing groups can help us extract information further.

For example, let us say we want to extract the numbers from the statement There are 25 apples and 30 oranges. We can design a pattern (such as [a-zA-S\s]*\d+[a-zA-S\s]*) which isolates the numbers, but then how do we extract them?

re.findall(r'[a-zA-S\s]*\d+[a-zA-S\s]*', 'There are 25 apples and 30 oranges')

['here are 25 apples and ', '30 oranges']

Well, if we surround a portion of the pattern with parenthesis, we can capture that portion and extract only that portion into our final list.

re.findall(r'[a-zA-S\s]*(\d+)[a-zA-S\s]*', 'There are 25 apples and 30 oranges')

['25', '30']

This is tricky as earlier, we’d introduced parenthesis as forms of specifying which letters we want to choose between.

re.search('Economic(s|a)','Economica')

<regex.Match object; span=(0, 9), match='Economica'>

This brings us to a key point: when extracting information with re.findall(), parentheses serve as capturing groups. However, when simply matching the first occurrence with re.search() without extracting anything, parentheses instead just help group characters together. This is something to be careful of; it is possible to rework any RegEx expression to not use parenthesis where they are not necessary. For example, you could transform the pattern Economic(s|a) to Economic[sa] and then use re.findall() to accomplish the same result as the re.search() call above.

re.findall('Economic[sa]','Economica')

['Economica']

text = "My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789."
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)  

['123-45-6789', '321-45-6789']

Integration with `Pandas` String Methods#

RegEx patterns work very well with pandas string methods. For example, the RegEx pattern ssn_pattern below matches all strings that follow the pattern of an American social security number (namely: 3 digits, dash, 2 digits, dash, 3 digits). Let us imagine you’ve surveyed a bunch of (American) people asking them their social security number. You’ve stored all their responses in the DataFrame ssn_data below. You can use str.findall() to extract all potential the social security numbers.

ssn_pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
# SSN Pattern: 3 digits, dash, 2 digits, dash, 3 digits

data = {
    "SSN": ["000-00-0000", "error", "876-54-3210 invalid or 210-54-3210", "111-11-1111", 
            "I refuse to provide", "456-78-9012", "abc-12-3456", "Nice try, FBI agent","777-77-7777", 
    "seventy-seven", "987-65-4321 please or 876-54-3210", "222-22-2222", "classified information", 
            "135-79-2468", "xyz-98-7654", "I only give my SSN to aliens","555-55-5555", "fifty-five", 
            "345-67-8901 not sure or 789-01-2345", "333-33-3333", "I'm a robot", "678-90-1234", 
            "aaa-11-2222", "Ask my cat for the SSN", "444-44-4444", "forty-four", 
            "234-56-7890 maybe or 876-54-3210", "444-44-4444", "top secret", "789-01-2345", 
            "bbb-22-3333", "I only speak in emojis 🤖", "666-66-6666", "sixty-six", 
            "432-10-9876 not applicable or 765-43-2109", "555-55-5555", "I'm a ghost", "012-34-5678", 
            "ccc-33-4444", "SSN? What's that?", "111-11-1111", "eleven", 
            "543-21-0987 avoid or 890-12-3456", "666-66-6666", "I'm on the run", "345-67-8901", 
            "ddd-44-5555", "My SSN is a secret code", "999-99-9999", "ninety-nine", 
            "654-32-1098 dodge or 210-98-7654", "777-77-7777", "I'm a spy", "567-89-0123", 
            "eee-55-6666", "I'll give you my SSN if you beat me in a dance-off", "123-45-6789", 
            "twelve", "765-43-2109 sidestep or 098-76-5432", "888-88-8888", "I'm an alien", 
            "678-90-1234", "fff-66-7777", "SSN? I speak in riddles", "234-56-7890", "twenty-four", 
            "876-54-3210 escape or 765-43-2109", "999-99-9999", "I'm a time traveler", "789-01-2345", 
            "ggg-77-8888", "My SSN is a password", "345-67-8901", "thirty-four", 
            "987-65-4321 run or 876-54-3210", "000-00-0000", "I'm a superhero", 
            "012-34-5678", "hhh-88-9999", "You'll never guess my SSN"]
}
ssn_data = pd.DataFrame(data)
ssn_data

	SSN
0	000-00-0000
1	error
2	876-54-3210 invalid or 210-54-3210
3	111-11-1111
4	I refuse to provide
...	...
75	000-00-0000
76	I'm a superhero
77	012-34-5678
78	hhh-88-9999
79	You'll never guess my SSN

80 rows × 1 columns

Applying the findall function;

ssn_data["SSN"].str.findall(ssn_pattern)

   [000-00-0000]
              []
   [210-54-3210]
   [111-11-1111]
              []
          ...      
  [000-00-0000]
             []
  [012-34-5678]
             []
             []
Name: SSN, Length: 80, dtype: object

If you were just interested in the digits of the social security numbers, you could use capturing groups combined with .str.extract(), as shown below.

capturing_ssn_pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
ssn_data["SSN"].str.extract(capturing_ssn_pattern) 
# Note: this just extracts the first matching pattern

	0	1	2
0	000	00	0000
1	NaN	NaN	NaN
2	876	54	3210
3	111	11	1111
4	NaN	NaN	NaN
...	...	...	...
75	000	00	0000
76	NaN	NaN	NaN
77	012	34	5678
78	NaN	NaN	NaN
79	NaN	NaN	NaN

80 rows × 3 columns