import pandas as pd
import numpy as np
import datetime
Helpful Pandas Methods#
This subchapter goes over a lot of simple but extremely useful pandas
methods that you will likely use a lot.
To start, let’s load in the same dataset as the first subchapter in the Pandas
chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:
st_name
: the state abbreviationyear
: the year the data was recordedc_beer
: the quantity of beer consumed, in thousands of gallonsbeer_tax
: the ad valorem tax, as a percentagebtax_dollars
: the excise tax, represented in dollars per case (24 cans) of beerpopulation
: the population of the state, in thousandssalestax
: the sales tax percentage
df = pd.read_csv('data/beer_tax.csv')
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Summary Statistics#
To start, we will note that pandas
is designed to work extremely well with NumPy
. Most (if not all) NumPy
functions can be applied on pandas
DataFrames and Series. A few examples are shown below.
np.sum(df['year'])
3393246
np.mean(df[['year','c_beer','beer_tax']], axis = 0)
# What does axis = 0 represent?
# Hint: read the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html
year 1986.677986
c_beer 108031.059719
beer_tax 6.525340
dtype: float64
df[['year','c_beer','beer_tax']].mean(axis = 0) # numpy.mean() and DataFrame.mean() are the same
year 1986.677986
c_beer 108031.059719
beer_tax 6.525340
dtype: float64
describe
#
.describe()
helps us easily compute several statistics from a DataFrame.
df.describe()
year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|
count | 1708.000000 | 1708.000000 | 1708.000000 | 1708.000000 | 1708.000000 | 1708.000000 |
mean | 1986.677986 | 108031.059719 | 6.525340 | 0.524762 | 4837.031030 | 4.288511 |
std | 9.761205 | 119280.769185 | 8.147708 | 0.523300 | 5275.527054 | 1.909266 |
min | 1970.000000 | 5372.000000 | 0.289643 | 0.045000 | 304.000000 | 0.000000 |
25% | 1978.000000 | 27886.750000 | 2.301018 | 0.188700 | 1219.000000 | 3.312250 |
50% | 1987.000000 | 68306.000000 | 3.879157 | 0.360000 | 3303.000000 | 4.462500 |
75% | 1995.000000 | 128838.500000 | 7.042739 | 0.603875 | 5703.250000 | 5.106250 |
max | 2003.000000 | 691050.000000 | 72.341130 | 2.407500 | 35484.000000 | 9.000000 |
Finally, there are a whole host of intuitive DataFrame methods that calculate statistics such as the average, standard deviation, median, percentiles, etc. While we cannot possibly describe them all here, we strongly encourage you to google around to learn more about them. Here is one link to get you started.
Sorting, Counting and Uniqueness#
Here are some of the methods used to deal with sorting, counting and uniqueness problems. Again, this list is not meant to be comprehensive, and you are encouraged to google around for other methods you may find useful.
Sorting#
As the name implies, .sort_values()
can be used to sort DataFrames and Series! If sorting a DataFrame, you need to specify which column you need to sort by, but there is naturally no such restriction for sorting series. Examples are shown below; note the ascending
parameter.
df.sort_values('population', ascending = False)
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
169 | CA | 2003 | 663750 | 2.896433 | 0.4500 | 35484 | 7.25 |
168 | CA | 2002 | 659475 | 2.962444 | 0.4500 | 35002 | 7.25 |
167 | CA | 2001 | 652950 | 3.009280 | 0.4500 | 34533 | 7.00 |
166 | CA | 2000 | 637080 | 3.094911 | 0.4500 | 34000 | 7.25 |
165 | CA | 1999 | 638017 | 3.198941 | 0.4500 | 33499 | 7.25 |
... | ... | ... | ... | ... | ... | ... | ... |
38 | AK | 1974 | 7598 | 10.810215 | 0.5625 | 341 | 0.00 |
37 | AK | 1973 | 6453 | 12.003234 | 0.5625 | 331 | 0.00 |
36 | AK | 1972 | 6038 | 12.749847 | 0.5625 | 324 | 0.00 |
35 | AK | 1971 | 6336 | 13.159102 | 0.5625 | 316 | 0.00 |
34 | AK | 1970 | 5372 | 13.735660 | 0.5625 | 304 | 0.00 |
1708 rows × 7 columns
df['population'].sort_values(ascending = True)
34 304
35 316
36 324
37 331
38 341
...
165 33499
166 34000
167 34533
168 35002
169 35484
Name: population, Length: 1708, dtype: int64
Counting#
Often used for Series (although it can be used for DataFrames as well), the .value_counts()
method counts the number of times a value (for Series) or row (for DataFrames) appears, and returns the values sorted from most common to least. For example,
df['st_name'].value_counts()
AL 34
VA 34
AK 34
MT 34
NE 34
NV 34
NH 34
NJ 34
NM 34
NY 34
NC 34
ND 34
OH 34
OK 34
OR 34
PA 34
RI 34
SC 34
SD 34
TN 34
TX 34
MS 34
MI 34
VT 34
MA 34
AZ 34
AR 34
CA 34
CO 34
CT 34
DE 34
DC 34
FL 34
GA 34
WI 34
ID 34
IL 34
IN 34
WV 34
WA 34
KY 34
LA 34
ME 34
MD 34
UT 34
MO 32
MN 32
KS 32
IA 32
WY 32
HI 18
Name: st_name, dtype: int64
df.value_counts()
st_name year c_beer beer_tax btax_dollars population salestax
AK 1970 5372 13.735660 0.5625 304 0.0 1
NY 1985 369214 1.362627 0.1238 17792 4.0 1
1983 394026 1.376949 0.1158 17687 4.0 1
1982 400138 1.227274 0.1000 17590 4.0 1
1981 395584 1.302881 0.1000 17568 4.0 1
..
KY 1978 64414 3.296839 0.1815 3610 5.0 1
1977 70755 3.547095 0.1815 3574 5.0 1
1976 59481 3.777749 0.1815 3529 5.0 1
1975 60230 3.995426 0.1815 3468 5.0 1
WY 2003 15535 0.289643 0.0450 501 4.0 1
Length: 1708, dtype: int64
Unique Values and Duplicates#
If we’re interested in just obtaining the unique values from a Series, we can use the .unique()
method.
df['st_name'].unique()
array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)
Alternatively, if we want all the duplicated values in a series or duplicated rows in a DataFrame, we can use .duplicated()
.
df['st_name'].duplicated()
0 False
1 True
2 True
3 True
4 True
...
1703 True
1704 True
1705 True
1706 True
1707 True
Name: st_name, Length: 1708, dtype: bool
df.duplicated() # Our dataframe has no duplicate rows!
0 False
1 False
2 False
3 False
4 False
...
1703 False
1704 False
1705 False
1706 False
1707 False
Length: 1708, dtype: bool
If we want to drop all the duplicated values, we can use .drop_duplicates()
. For a series, this will return all non-duplicated values and for a DataFrame, it will return all non-duplicated rows.
df['st_name'].drop_duplicates()
0 AL
34 AK
68 AZ
102 AR
136 CA
170 CO
204 CT
238 DE
272 DC
306 FL
340 GA
374 HI
392 ID
426 IL
460 IN
494 IA
526 KS
558 KY
592 LA
626 ME
660 MD
694 MA
728 MI
762 MN
794 MS
828 MO
860 MT
894 NE
928 NV
962 NH
996 NJ
1030 NM
1064 NY
1098 NC
1132 ND
1166 OH
1200 OK
1234 OR
1268 PA
1302 RI
1336 SC
1370 SD
1404 TN
1438 TX
1472 UT
1506 VT
1540 VA
1574 WA
1608 WV
1642 WI
1676 WY
Name: st_name, dtype: object
df.drop_duplicates() # We don't drop anything as we don't have duplicates
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Missing Values#
Missing values are often stored as NaN
values in our dataset. While our dataset doesn’t have any missing values, we could use .isna()
to check for missing values.
df.isna()
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | False | False | False | False | False | False | False |
1704 | False | False | False | False | False | False | False |
1705 | False | False | False | False | False | False | False |
1706 | False | False | False | False | False | False | False |
1707 | False | False | False | False | False | False | False |
1708 rows × 7 columns
df.isna().any(axis = 0) # See which columns have NaN values
# Hint: read the documentation for df.any(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html
st_name False
year False
c_beer False
beer_tax False
btax_dollars False
population False
salestax False
dtype: bool
df.isna().any(axis = 1) # See which rows have NaN values
0 False
1 False
2 False
3 False
4 False
...
1703 False
1704 False
1705 False
1706 False
1707 False
Length: 1708, dtype: bool
print(df.isna().any(axis = 0).any()) #These can be used to check if there are NaN's anywhere in the data
print(df.isna().any(axis = 1).any())
False
False
.isna()
also works for Series.
df['st_name'].isna()
0 False
1 False
2 False
3 False
4 False
...
1703 False
1704 False
1705 False
1706 False
1707 False
Name: st_name, Length: 1708, dtype: bool
df['st_name'].isna().any()
False
String Methods#
String methods are a great tool for working with string data. They only work on Series. To use them, you add a .str
after your Series to indicate that you want to use a string method, followed by the method you want to use. Some of the various string methods are given below, but again this list is not comprehensive.
len
#
Gives the length of the strings.
df['st_name'].str.len()
0 2
1 2
2 2
3 2
4 2
..
1703 2
1704 2
1705 2
1706 2
1707 2
Name: st_name, Length: 1708, dtype: int64
String slicing#
Like list slicing, but for strings. Review the Python prereqs if you’re unsure.
df['st_name'].str[0]
0 A
1 A
2 A
3 A
4 A
..
1703 W
1704 W
1705 W
1706 W
1707 W
Name: st_name, Length: 1708, dtype: object
contains
#
Returns if a string contains the given substring.
df['st_name'].str.contains('A')
0 True
1 True
2 True
3 True
4 True
...
1703 False
1704 False
1705 False
1706 False
1707 False
Name: st_name, Length: 1708, dtype: bool
replace
#
Replaces a substring with the other substring.
df['st_name'].str.replace('A','a')
0 aL
1 aL
2 aL
3 aL
4 aL
..
1703 WY
1704 WY
1705 WY
1706 WY
1707 WY
Name: st_name, Length: 1708, dtype: object
upper
and lower
#
Capitalizes or uncapitalizes a Series.
df['st_name'].str.upper()
0 AL
1 AL
2 AL
3 AL
4 AL
..
1703 WY
1704 WY
1705 WY
1706 WY
1707 WY
Name: st_name, Length: 1708, dtype: object
df['st_name'].str.lower()
0 al
1 al
2 al
3 al
4 al
..
1703 wy
1704 wy
1705 wy
1706 wy
1707 wy
Name: st_name, Length: 1708, dtype: object
startswith
and endswith
#
Returns if a string starts with or ends with the given substring.
df['st_name'].str.startswith('A')
0 True
1 True
2 True
3 True
4 True
...
1703 False
1704 False
1705 False
1706 False
1707 False
Name: st_name, Length: 1708, dtype: bool
df['st_name'].str.endswith('A')
0 False
1 False
2 False
3 False
4 False
...
1703 False
1704 False
1705 False
1706 False
1707 False
Name: st_name, Length: 1708, dtype: bool
split
#
Splits the strings on the given delimiter. Performs no splits if the delimiter doesn’t exist in the string. Returns a list of all the split substrings. Best understood by looking at an example.
samp_series = df.loc[[68, 102, 1574, 592, 1302, 1642],'st_name']
samp_series
68 AZ
102 AR
1574 WA
592 LA
1302 RI
1642 WI
Name: st_name, dtype: object
samp_series.str.split('A')
68 [, Z]
102 [, R]
1574 [W, ]
592 [L, ]
1302 [RI]
1642 [WI]
Name: st_name, dtype: object
As you can see above, all the strings have been split on the letter A
. Strings which don’t have the letter A
weren’t split.
If we want to return the output as a DataFrame, we can set the expand
parameter to be True
, as shown below.
samp_series.str.split('A', expand = True)
0 | 1 | |
---|---|---|
68 | Z | |
102 | R | |
1574 | W | |
592 | L | |
1302 | RI | None |
1642 | WI | None |
DateTime Objects#
DateTime
objects are incredibly helpful for working with dates. They help us easily perform calculations like add/subtract days/dates, convert timezones, sort dates, etc. We can convert a Series to a DateTime object by using pd.to_datetime()
, as done below.
pd.to_datetime(df['year'], format='%Y')
0 1970-01-01
1 1971-01-01
2 1972-01-01
3 1973-01-01
4 1974-01-01
...
1703 1999-01-01
1704 2000-01-01
1705 2001-01-01
1706 2002-01-01
1707 2003-01-01
Name: year, Length: 1708, dtype: datetime64[ns]
Notice how we specify the format of our dates to ensure the conversion happens correctly. We don’t always need to do this (especially if we have the year, month and day in a well-known format), but it is good practice to ensure the conversion happens correctly.
Let us convert the year
column in the DataFrame to a DateTime object and perform some calculations
df['year'] = pd.to_datetime(df['year'], format='%Y')
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970-01-01 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971-01-01 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972-01-01 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973-01-01 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974-01-01 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999-01-01 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000-01-01 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001-01-01 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002-01-01 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003-01-01 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Notice how the dates all default to 1st January. Let’s say we know that the measurements were actually done 4 days after the given date for each year. We can implement this in our DataFrame.
df['year'] = df['year'] + pd.DateOffset(days=4)
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970-01-05 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971-01-05 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972-01-05 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973-01-05 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974-01-05 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999-01-05 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000-01-05 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001-01-05 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002-01-05 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003-01-05 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Alternatively, if we instead know that the measurement was always taken on June 5th, we can specify that as well. Feel free to ignore the PerformanceWarning
for now, but you can read more here if you’re interested.
df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
df
/var/folders/td/dtym05z13119cjklqb5tyd040000gn/T/ipykernel_89001/3513634716.py:1: PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex.
df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970-06-05 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971-06-05 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972-06-05 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973-06-05 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974-06-05 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999-06-05 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000-06-05 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001-06-05 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002-06-05 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003-06-05 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Note the difference between the day
and the days
parameter; days
adds to the current day while day
replaces the current day. Similar differences exist for month
vs months
, year
vs years
, etc.
Now, let’s look at how a datapoint is stored.
df.loc[0,'year']
Timestamp('1970-06-05 00:00:00')
In the 20th index, the year is 1990. Let’s subtract the dates.
df.loc[20,'year'] - df.loc[0,'year']
Timedelta('7305 days 00:00:00')
Interesting, we can also subtract the year’s directly.
df.loc[20,'year'].year - df.loc[0,'year'].year
20
Let’s also try subtracting the months.
df.loc[20,'year'].month - df.loc[0,'year'].month
0
The months are subtracted directly. As both of these months were set to June, there is no difference. Instead, to see the number of months between two dates, we must first subtract the years, multiply that by 12, and then subtract the months and add that. The calculation has been done below.
(df.loc[20,'year'].year - df.loc[0,'year'].year) * 12 + (df.loc[20,'year'].month - df.loc[0,'year'].month)
240
Finally, using DateTime objects also helps us sort dates easily!
df.sort_values('year')
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970-06-05 | 33098 | 72.341130 | 2.3700 | 3450 | 4.00 |
170 | CO | 1970-06-05 | 42145 | 4.120698 | 0.1350 | 2224 | 3.00 |
592 | LA | 1970-06-05 | 68578 | 25.456755 | 0.8340 | 3650 | 2.00 |
1302 | RI | 1970-06-05 | 20360 | 5.686564 | 0.1863 | 951 | 5.00 |
860 | MT | 1970-06-05 | 17179 | 6.648059 | 0.2178 | 697 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... |
135 | AR | 2003-06-05 | 52045 | 3.507258 | 0.5449 | 2726 | 5.13 |
893 | MT | 2003-06-05 | 26100 | 2.008837 | 0.3121 | 918 | 0.00 |
1471 | TX | 2003-06-05 | 558837 | 2.867468 | 0.4455 | 22119 | 6.25 |
1675 | WI | 2003-06-05 | 151000 | 0.934582 | 0.1452 | 5472 | 5.00 |
1707 | WY | 2003-06-05 | 15535 | 0.289643 | 0.0450 | 501 | 4.00 |
1708 rows × 7 columns