Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Helpful Pandas Methods

import pandas as pd
import numpy as np
import datetime

This subchapter goes over a lot of simple but extremely useful pandas methods that you will likely use a lot.

To start, let’s load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

  • st_name: the state abbreviation

  • year: the year the data was recorded

  • c_beer: the quantity of beer consumed, in thousands of gallons

  • beer_tax: the ad valorem tax, as a percentage

  • btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer

  • population: the population of the state, in thousands

  • salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df
Loading...

Summary Statistics

To start, we will note that pandas is designed to work extremely well with NumPy. Most (if not all) NumPy functions can be applied on pandas DataFrames and Series. A few examples are shown below.

np.sum(df['year'])
3393246
np.mean(df[['year','c_beer','beer_tax']], axis = 0)

# What does axis = 0 represent? 
# Hint: read the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html
year 1986.677986 c_beer 108031.059719 beer_tax 6.525340 dtype: float64
df[['year','c_beer','beer_tax']].mean(axis = 0) # numpy.mean() and DataFrame.mean() are the same
year 1986.677986 c_beer 108031.059719 beer_tax 6.525340 dtype: float64

describe

.describe() helps us easily compute several statistics from a DataFrame.

df.describe()
Loading...

Finally, there are a whole host of intuitive DataFrame methods that calculate statistics such as the average, standard deviation, median, percentiles, etc. While we cannot possibly describe them all here, we strongly encourage you to google around to learn more about them. Here is one link to get you started.

Sorting, Counting and Uniqueness

Here are some of the methods used to deal with sorting, counting and uniqueness problems. Again, this list is not meant to be comprehensive, and you are encouraged to google around for other methods you may find useful.

Sorting

As the name implies, .sort_values() can be used to sort DataFrames and Series! If sorting a DataFrame, you need to specify which column you need to sort by, but there is naturally no such restriction for sorting series. Examples are shown below; note the ascending parameter.

df.sort_values('population', ascending = False)
Loading...
df['population'].sort_values(ascending = True)
34 304 35 316 36 324 37 331 38 341 ... 165 33499 166 34000 167 34533 168 35002 169 35484 Name: population, Length: 1708, dtype: int64

Counting

Often used for Series (although it can be used for DataFrames as well), the .value_counts() method counts the number of times a value (for Series) or row (for DataFrames) appears, and returns the values sorted from most common to least. For example,

df['st_name'].value_counts()
AL 34 VA 34 AK 34 MT 34 NE 34 NV 34 NH 34 NJ 34 NM 34 NY 34 NC 34 ND 34 OH 34 OK 34 OR 34 PA 34 RI 34 SC 34 SD 34 TN 34 TX 34 MS 34 MI 34 VT 34 MA 34 AZ 34 AR 34 CA 34 CO 34 CT 34 DE 34 DC 34 FL 34 GA 34 WI 34 ID 34 IL 34 IN 34 WV 34 WA 34 KY 34 LA 34 ME 34 MD 34 UT 34 MO 32 MN 32 KS 32 IA 32 WY 32 HI 18 Name: st_name, dtype: int64
df.value_counts()
st_name year c_beer beer_tax btax_dollars population salestax AK 1970 5372 13.735660 0.5625 304 0.0 1 NY 1985 369214 1.362627 0.1238 17792 4.0 1 1983 394026 1.376949 0.1158 17687 4.0 1 1982 400138 1.227274 0.1000 17590 4.0 1 1981 395584 1.302881 0.1000 17568 4.0 1 .. KY 1978 64414 3.296839 0.1815 3610 5.0 1 1977 70755 3.547095 0.1815 3574 5.0 1 1976 59481 3.777749 0.1815 3529 5.0 1 1975 60230 3.995426 0.1815 3468 5.0 1 WY 2003 15535 0.289643 0.0450 501 4.0 1 Length: 1708, dtype: int64

Unique Values and Duplicates

If we’re interested in just obtaining the unique values from a Series, we can use the .unique() method.

df['st_name'].unique()
array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

Alternatively, if we want all the duplicated values in a series or duplicated rows in a DataFrame, we can use .duplicated().

df['st_name'].duplicated()
0 False 1 True 2 True 3 True 4 True ... 1703 True 1704 True 1705 True 1706 True 1707 True Name: st_name, Length: 1708, dtype: bool
df.duplicated() # Our dataframe has no duplicate rows!
0 False 1 False 2 False 3 False 4 False ... 1703 False 1704 False 1705 False 1706 False 1707 False Length: 1708, dtype: bool

If we want to drop all the duplicated values, we can use .drop_duplicates(). For a series, this will return all non-duplicated values and for a DataFrame, it will return all non-duplicated rows.

df['st_name'].drop_duplicates()
0 AL 34 AK 68 AZ 102 AR 136 CA 170 CO 204 CT 238 DE 272 DC 306 FL 340 GA 374 HI 392 ID 426 IL 460 IN 494 IA 526 KS 558 KY 592 LA 626 ME 660 MD 694 MA 728 MI 762 MN 794 MS 828 MO 860 MT 894 NE 928 NV 962 NH 996 NJ 1030 NM 1064 NY 1098 NC 1132 ND 1166 OH 1200 OK 1234 OR 1268 PA 1302 RI 1336 SC 1370 SD 1404 TN 1438 TX 1472 UT 1506 VT 1540 VA 1574 WA 1608 WV 1642 WI 1676 WY Name: st_name, dtype: object
df.drop_duplicates() # We don't drop anything as we don't have duplicates
Loading...

Missing Values

Missing values are often stored as NaN values in our dataset. While our dataset doesn’t have any missing values, we could use .isna() to check for missing values.

df.isna()
Loading...
df.isna().any(axis = 0) # See which columns have NaN values
# Hint: read the documentation for df.any(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html
st_name False year False c_beer False beer_tax False btax_dollars False population False salestax False dtype: bool
df.isna().any(axis = 1) # See which rows have NaN values
0 False 1 False 2 False 3 False 4 False ... 1703 False 1704 False 1705 False 1706 False 1707 False Length: 1708, dtype: bool
print(df.isna().any(axis = 0).any()) #These can be used to check if there are NaN's anywhere in the data
print(df.isna().any(axis = 1).any()) 
False
False

.isna() also works for Series.

df['st_name'].isna()
0 False 1 False 2 False 3 False 4 False ... 1703 False 1704 False 1705 False 1706 False 1707 False Name: st_name, Length: 1708, dtype: bool
df['st_name'].isna().any()
False

String Methods

String methods are a great tool for working with string data. They only work on Series. To use them, you add a .str after your Series to indicate that you want to use a string method, followed by the method you want to use. Some of the various string methods are given below, but again this list is not comprehensive.

len

Gives the length of the strings.

df['st_name'].str.len()
0 2 1 2 2 2 3 2 4 2 .. 1703 2 1704 2 1705 2 1706 2 1707 2 Name: st_name, Length: 1708, dtype: int64

String slicing

Like list slicing, but for strings. Review the Python prereqs if you’re unsure.

df['st_name'].str[0]
0 A 1 A 2 A 3 A 4 A .. 1703 W 1704 W 1705 W 1706 W 1707 W Name: st_name, Length: 1708, dtype: object

contains

Returns if a string contains the given substring.

df['st_name'].str.contains('A')
0 True 1 True 2 True 3 True 4 True ... 1703 False 1704 False 1705 False 1706 False 1707 False Name: st_name, Length: 1708, dtype: bool

replace

Replaces a substring with the other substring.

df['st_name'].str.replace('A','a')
0 aL 1 aL 2 aL 3 aL 4 aL .. 1703 WY 1704 WY 1705 WY 1706 WY 1707 WY Name: st_name, Length: 1708, dtype: object

upper and lower

Capitalizes or uncapitalizes a Series.

df['st_name'].str.upper()
0 AL 1 AL 2 AL 3 AL 4 AL .. 1703 WY 1704 WY 1705 WY 1706 WY 1707 WY Name: st_name, Length: 1708, dtype: object
df['st_name'].str.lower()
0 al 1 al 2 al 3 al 4 al .. 1703 wy 1704 wy 1705 wy 1706 wy 1707 wy Name: st_name, Length: 1708, dtype: object

startswith and endswith

Returns if a string starts with or ends with the given substring.

df['st_name'].str.startswith('A')
0 True 1 True 2 True 3 True 4 True ... 1703 False 1704 False 1705 False 1706 False 1707 False Name: st_name, Length: 1708, dtype: bool
df['st_name'].str.endswith('A')
0 False 1 False 2 False 3 False 4 False ... 1703 False 1704 False 1705 False 1706 False 1707 False Name: st_name, Length: 1708, dtype: bool

split

Splits the strings on the given delimiter. Performs no splits if the delimiter doesn’t exist in the string. Returns a list of all the split substrings. Best understood by looking at an example.

samp_series = df.loc[[68, 102, 1574, 592, 1302, 1642],'st_name']
samp_series
68 AZ 102 AR 1574 WA 592 LA 1302 RI 1642 WI Name: st_name, dtype: object
samp_series.str.split('A')
68 [, Z] 102 [, R] 1574 [W, ] 592 [L, ] 1302 [RI] 1642 [WI] Name: st_name, dtype: object

As you can see above, all the strings have been split on the letter A. Strings which don’t have the letter A weren’t split.

If we want to return the output as a DataFrame, we can set the expand parameter to be True, as shown below.

samp_series.str.split('A', expand = True)
Loading...

DateTime Objects

DateTime objects are incredibly helpful for working with dates. They help us easily perform calculations like add/subtract days/dates, convert timezones, sort dates, etc. We can convert a Series to a DateTime object by using pd.to_datetime(), as done below.

pd.to_datetime(df['year'], format='%Y')
0 1970-01-01 1 1971-01-01 2 1972-01-01 3 1973-01-01 4 1974-01-01 ... 1703 1999-01-01 1704 2000-01-01 1705 2001-01-01 1706 2002-01-01 1707 2003-01-01 Name: year, Length: 1708, dtype: datetime64[ns]

Notice how we specify the format of our dates to ensure the conversion happens correctly. We don’t always need to do this (especially if we have the year, month and day in a well-known format), but it is good practice to ensure the conversion happens correctly.

Let us convert the year column in the DataFrame to a DateTime object and perform some calculations

df['year'] = pd.to_datetime(df['year'], format='%Y')
df
Loading...

Notice how the dates all default to 1st January. Let’s say we know that the measurements were actually done 4 days after the given date for each year. We can implement this in our DataFrame.

df['year'] = df['year'] + pd.DateOffset(days=4)
df
Loading...

Alternatively, if we instead know that the measurement was always taken on June 5th, we can specify that as well. Feel free to ignore the PerformanceWarning for now, but you can read more here if you’re interested.

df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
df
/var/folders/td/dtym05z13119cjklqb5tyd040000gn/T/ipykernel_89001/3513634716.py:1: PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex.
  df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
Loading...

Note the difference between the day and the days parameter; days adds to the current day while day replaces the current day. Similar differences exist for month vs months, year vs years, etc.

Now, let’s look at how a datapoint is stored.

df.loc[0,'year']
Timestamp('1970-06-05 00:00:00')

In the 20th index, the year is 1990. Let’s subtract the dates.

df.loc[20,'year'] - df.loc[0,'year']
Timedelta('7305 days 00:00:00')

Interesting, we can also subtract the year’s directly.

df.loc[20,'year'].year - df.loc[0,'year'].year
20

Let’s also try subtracting the months.

df.loc[20,'year'].month - df.loc[0,'year'].month
0

The months are subtracted directly. As both of these months were set to June, there is no difference. Instead, to see the number of months between two dates, we must first subtract the years, multiply that by 12, and then subtract the months and add that. The calculation has been done below.

(df.loc[20,'year'].year - df.loc[0,'year'].year) * 12 + (df.loc[20,'year'].month - df.loc[0,'year'].month)
240

Finally, using DateTime objects also helps us sort dates easily!

df.sort_values('year')
Loading...
References
  1. Chetty, R., Looney, A., & Kroft, K. (2009). Salience and Taxation: Theory and Evidence. American Economic Review, 99(4), 1145–1177. 10.1257/aer.99.4.1145