Joins

Contents

import pandas as pd

Joins#

In this subchapter, we will go over how to combine different DataFrames in pandas.

To start, let’s load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

  • st_name: the state abbreviation

  • year: the year the data was recorded

  • c_beer: the quantity of beer consumed, in thousands of gallons

  • beer_tax: the ad valorem tax, as a percentage

  • btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer

  • population: the population of the state, in thousands

  • salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

However, as our purpose is to discuss combining different DataFrames, let us now also load in two additional datasets. The first dataset has the countrywide CPI data for the US, sourced from the Bureau of Labor Statistics. The second dataset has statewide CPI data for some states, sourced from The Slope of the Phillips Curve: Evidence from U.S. States by Jonathon Hazell, Juan Herreño, Emi Nakamura, Jón Steinsson (QJE 2022, link).

us_cpi = pd.read_csv('data/us_cpi.csv')
us_cpi.head()
year US CPI
0 1967 35.980357
1 1968 38.061600
2 1969 39.962510
3 1970 42.184851
4 1971 44.389292
state_cpi = pd.read_csv('data/state_cpi.csv')
state_cpi.head()
year state State CPI
0 1989 AL 4.002484
1 1990 AL 3.549080
2 1991 AL 3.598064
3 1992 AL 1.077842
4 1993 AL 2.445113

To join two DataFrames, they must have atleast 1 column that shares the same data, so we can match the different rows appropriately. These columns are often called as foreign keys that map from one table to another. Both the US-wide and statewide CPI tables share the year column with the beer_tax DataFrame, and the statewide CPI table also shares the state column with the beer_tax DataFrame.

First, let’s try joining the US Inflation data with the beer tax DataFrame. We can use pd.merge() to accomplish this. A simple join is done below.

pd.merge(left = df, right = us_inf, left_on = 'year', right_on = 'year')
st_name year c_beer beer_tax btax_dollars population salestax US Inflation
0 AL 1970 33098 72.341130 2.370000 3450 4.0 42.184851
1 AK 1970 5372 13.735660 0.562500 304 0.0 42.184851
2 AZ 1970 38604 5.494264 0.180000 1795 3.0 42.184851
3 AR 1970 22378 16.632357 0.544900 1930 3.0 42.184851
4 CA 1970 363645 2.747132 0.090000 20023 5.0 42.184851
... ... ... ... ... ... ... ... ...
1703 VA 2003 151706 4.093625 0.636000 7386 4.5 254.390503
1704 WA 2003 116550 3.769560 0.585652 6131 6.5 254.390503
1705 WV 2003 41400 2.569457 0.399200 1810 6.0 254.390503
1706 WI 2003 151000 0.934582 0.145200 5472 5.0 254.390503
1707 WY 2003 15535 0.289643 0.045000 501 4.0 254.390503

1708 rows × 8 columns

In the pd.merge() function;

  • the left parameter represents one of the DataFrames to combine, this DataFrame is referred to as the ‘left’ DataFrame.

  • the right parameter represents the other DataFrame to combine, this DataFrame is referred to as the ‘right’ DataFrame.

  • the left_on parameter represents the column in the left DataFrame which contains the shared data.

  • the right_on parameter represents the column in the right DataFrame which contains the shared data.

The left_on and right_on parameters can also take in multiple columns, which comes in handy when we do the merge with statewide inflation data.

mult_col_merge = pd.merge(left = df, right = state_inf, left_on = ['st_name','year'], right_on = ['state','year'])
mult_col_merge
st_name year c_beer beer_tax btax_dollars population salestax state State CPI
0 AL 1989 79990 22.635775 2.3700 4030 4.0 AL 4.002484
1 AL 1990 80899 21.475412 2.3700 4050 4.0 AL 3.549080
2 AL 1991 72921 20.608194 2.3700 4099 4.0 AL 3.598064
3 AL 1992 75192 20.005960 2.3700 4154 4.0 AL 1.077842
4 AL 1993 85380 19.424470 2.3700 4214 4.0 AL 2.445113
... ... ... ... ... ... ... ... ... ...
662 WI 1999 146162 1.032192 0.1452 5333 5.0 WI 2.412967
663 WI 2000 147700 0.998625 0.1452 5374 5.0 WI 3.154635
664 WI 2001 149332 0.970995 0.1452 5405 5.0 WI 2.186323
665 WI 2002 153076 0.955882 0.1452 5440 5.0 WI 1.825251
666 WI 2003 151000 0.934582 0.1452 5472 5.0 WI 3.145692

667 rows × 9 columns

Notice how the output includes both the st_name and state columns. However, these columns are always equal to each other.

(mult_col_merge['st_name'] != mult_col_merge['state']).any()
# We're checking to see if there are any values where the two columns are different
False

While the merges above were useful, they ignored a crucial part of doing merges: the how parameter.

Types of Joins#

It is very important to understand the four types of join: inner, left, right and outer.

Remember, you are merging two tables on a set of columns. It is possible that one table has a certain value for those shared columns that the other table doesn’t. For example, we have US CPI data from 1967-2023, although our beer dataset only goes from 1970-2003. So, the US CPI DataFrame has several rows which have no match in the beer tax dataframe. The differences between the types of join come from how we deal with these missing values.

  1. An inner join only returns rows where their is a shared value in both tables, it doesn’t allow any NaN’s to be output in the shared columns. The default join is the inner join, so it is equivalent to the merges we did before.

  2. A left join keeps all of the values from the shared column in the left table, even if they do not exist in the right table. For example, the beer tax table contains data on plenty of state/year combinations that the statewide CPI dataset doesn’t contain. Let’s do a left join on these two datasets. To specify which join type we would like to use, we only need to include a how parameter in our pd.merge() call.

pd.merge(left = df, right = state_inf, left_on = ['st_name','year'], right_on = ['state','year'], how = 'left')
st_name year c_beer beer_tax btax_dollars population salestax state State CPI
0 AL 1970 33098 72.341130 2.370 3450 4.0 NaN NaN
1 AL 1971 37598 69.304600 2.370 3497 4.0 NaN NaN
2 AL 1972 42719 67.149190 2.370 3539 4.0 NaN NaN
3 AL 1973 46203 63.217026 2.370 3580 4.0 NaN NaN
4 AL 1974 49769 56.933796 2.370 3627 4.0 NaN NaN
... ... ... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0 NaN NaN
1704 WY 2000 12595 0.309491 0.045 494 4.0 NaN NaN
1705 WY 2001 12808 0.300928 0.045 494 4.0 NaN NaN
1706 WY 2002 13191 0.296244 0.045 499 4.0 NaN NaN
1707 WY 2003 15535 0.289643 0.045 501 4.0 NaN NaN

1708 rows × 9 columns

Notice how there are so many NaN values where we have beer data but no state-wide CPI data! As a matter of fact, we’re missing statewide inflation data for over 60% of the state-year combinations in the beer tax dataset.

pd.merge(left = df, right = state_inf, left_on = ['st_name','year'], right_on = ['state','year'], how = 'left')\
['State CPI'].isna().mean()
0.6094847775175644
  1. Similar to a left join, a right join keeps all of the values from the shared column in the right table, even if they do not exist in the left table. For example, let us do a right join with the US inflation dataset, which has more data than the beer tax dataset.

pd.merge(left = df, right = us_inf, left_on = 'year', right_on = 'year', how = 'right')
st_name year c_beer beer_tax btax_dollars population salestax US Inflation
0 NaN 1967 NaN NaN NaN NaN NaN 35.980357
1 NaN 1968 NaN NaN NaN NaN NaN 38.061600
2 NaN 1969 NaN NaN NaN NaN NaN 39.962510
3 AL 1970 33098.0 72.34113 2.3700 3450.0 4.0 42.184851
4 AK 1970 5372.0 13.73566 0.5625 304.0 0.0 42.184851
... ... ... ... ... ... ... ... ...
1726 NaN 2019 NaN NaN NaN NaN NaN 358.843059
1727 NaN 2020 NaN NaN NaN NaN NaN 367.804604
1728 NaN 2021 NaN NaN NaN NaN NaN 375.806302
1729 NaN 2022 NaN NaN NaN NaN NaN 388.034036
1730 NaN 2023 NaN NaN NaN NaN NaN 403.685228

1731 rows × 8 columns

As expected, we have US inflation data for several years before and after the end of the beer tax dataset, resulting in NaN values for those years!

  1. Finally, an outer join (or full join) keeps all the values from both the tables, regardless of whether or not they exist in the other table. In the example shown below, the beer tax dataset has information on plenty of state-year combinations not present in the statewide dataset, but the statewide dataset has data upto 2017 for some states, while the beer tax dataset ends at 2003 for all states.

pd.merge(left = df, right = state_inf, left_on = ['st_name','year'], right_on = ['state','year'], how = 'outer')
st_name year c_beer beer_tax btax_dollars population salestax state State CPI
0 AL 1970 33098.0 72.341130 2.37 3450.0 4.0 NaN NaN
1 AL 1971 37598.0 69.304600 2.37 3497.0 4.0 NaN NaN
2 AL 1972 42719.0 67.149190 2.37 3539.0 4.0 NaN NaN
3 AL 1973 46203.0 63.217026 2.37 3580.0 4.0 NaN NaN
4 AL 1974 49769.0 56.933796 2.37 3627.0 4.0 NaN NaN
... ... ... ... ... ... ... ... ... ...
2211 NaN 2013 NaN NaN NaN NaN NaN WI 1.336107
2212 NaN 2014 NaN NaN NaN NaN NaN WI 1.655367
2213 NaN 2015 NaN NaN NaN NaN NaN WI 0.277752
2214 NaN 2016 NaN NaN NaN NaN NaN WI 0.111245
2215 NaN 2017 NaN NaN NaN NaN NaN WI 2.100405

2216 rows × 9 columns

Finally, here are a couple of interesting visualizations of the different types of joins.