import pandas as pd
import numpy as np

GroupBys and Pivot Tables#

This subchapter goes over groupby’s and pivot table’s, two incredibly useful pandas methods.

To start, let’s load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

  • st_name: the state abbreviation

  • year: the year the data was recorded

  • c_beer: the quantity of beer consumed, in thousands of gallons

  • beer_tax: the ad valorem tax, as a percentage

  • btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer

  • population: the population of the state, in thousands

  • salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

GroupBys#

Let’s say we’re interested in how the average tax differs by state. Right now, we have a fair amount of data for every state, so it’d great if we could combine the data for each state somehow. .groupby() helps us do exactly that.

When we groupby on a column ‘x’, we create groupby objects for each unique value in the column. Each groupby object contains all the data corresponding to that unique value from the original DataFrame. The code below visualizes the first 5 rows from the groupby objects corresponding to the first 5 states in our dataset.

i = 0
for name, frame in df.groupby('st_name'):
    if i > 5:
        continue
    print(f'The first 5 rows from the subframe for {name} are: ')
    print(frame.head())
    print('________________________________________________________________________')
    i += 1
The first 5 rows from the subframe for AK are: 
   st_name  year  c_beer   beer_tax  btax_dollars  population  salestax
34      AK  1970    5372  13.735660        0.5625         304       0.0
35      AK  1971    6336  13.159102        0.5625         316       0.0
36      AK  1972    6038  12.749847        0.5625         324       0.0
37      AK  1973    6453  12.003234        0.5625         331       0.0
38      AK  1974    7598  10.810215        0.5625         341       0.0
________________________________________________________________________
The first 5 rows from the subframe for AL are: 
  st_name  year  c_beer   beer_tax  btax_dollars  population  salestax
0      AL  1970   33098  72.341130          2.37        3450       4.0
1      AL  1971   37598  69.304600          2.37        3497       4.0
2      AL  1972   42719  67.149190          2.37        3539       4.0
3      AL  1973   46203  63.217026          2.37        3580       4.0
4      AL  1974   49769  56.933796          2.37        3627       4.0
________________________________________________________________________
The first 5 rows from the subframe for AR are: 
    st_name  year  c_beer   beer_tax  btax_dollars  population  salestax
102      AR  1970   22378  16.632357        0.5449        1930       3.0
103      AR  1971   25020  15.934210        0.5449        1972       3.0
104      AR  1972   25614  15.438649        0.5449        2019       3.0
105      AR  1973   27946  14.534582        0.5449        2059       3.0
106      AR  1974   30915  13.089970        0.5449        2101       3.0
________________________________________________________________________
The first 5 rows from the subframe for AZ are: 
   st_name  year  c_beer  beer_tax  btax_dollars  population  salestax
68      AZ  1970   38604  5.494264          0.18        1795       3.0
69      AZ  1971   41837  5.263641          0.18        1896       3.0
70      AZ  1972   47949  5.099939          0.18        2008       3.0
71      AZ  1973   53380  4.801294          0.18        2124       3.0
72      AZ  1974   58188  4.324086          0.18        2223       3.0
________________________________________________________________________
The first 5 rows from the subframe for CA are: 
    st_name  year  c_beer  beer_tax  btax_dollars  population  salestax
136      CA  1970  363645  2.747132          0.09       20023     5.000
137      CA  1971  380397  2.631820          0.09       20346     5.000
138      CA  1972  401928  2.549970          0.09       20585     5.000
139      CA  1973  417463  2.400647          0.09       20869     5.167
140      CA  1974  464237  2.162043          0.09       21174     5.250
________________________________________________________________________
The first 5 rows from the subframe for CO are: 
    st_name  year  c_beer  beer_tax  btax_dollars  population  salestax
170      CO  1970   42145  4.120698         0.135        2224       3.0
171      CO  1971   45359  3.947731         0.135        2304       3.0
172      CO  1972   50444  3.824955         0.135        2405       3.0
173      CO  1973   55332  3.600970         0.135        2496       3.0
174      CO  1974   60162  3.243065         0.135        2541       3.0
________________________________________________________________________

Aggregate#

Now, we wish to collapse all the groupby objects into a single row, and return a DataFrame that has all the single rows across all groupby objects. We can do that with the help of an aggregation function which tells pandas how to combine the rows. For example, the aggregation function .mean() tells pandas to take the mean of the rows in the groupby object. We do this below.

df.groupby('st_name').mean()
year c_beer beer_tax btax_dollars population salestax
st_name
AK 1986.5 12131.117647 7.208826 0.745259 503.529412 0.000000
AL 1986.5 72660.411765 31.817266 2.370000 4022.294118 4.000000
AR 1986.5 41951.705882 7.315287 0.544900 2359.441176 3.813235
AZ 1986.5 93827.470588 3.282313 0.283235 3490.000000 4.494118
CA 1986.5 585206.794118 2.189079 0.222353 27467.588235 6.320794
CO 1986.5 81713.058824 2.189892 0.171397 3282.941176 3.013235
CT 1986.5 59092.058824 3.457670 0.307591 3228.500000 6.617647
DC 1986.5 16505.735294 2.320071 0.180018 633.676471 6.823529
DE 1986.5 16022.147059 2.557550 0.226297 659.970588 0.000000
FL 1986.5 299864.882353 15.981612 1.290588 11903.176471 5.073529
GA 1986.5 125192.500000 30.609017 2.280000 6346.941176 3.411765
HI 1994.5 30017.666667 14.083035 1.994400 1168.833333 4.000000
IA 1987.5 67063.187500 4.402103 0.376525 2868.093750 3.968750
ID 1986.5 22516.970588 4.530940 0.337500 1025.823529 4.117647
IL 1986.5 271348.970588 2.182229 0.167700 11664.470588 5.595588
IN 1986.5 113699.176471 3.078238 0.241494 5619.705882 4.382353
KS 1987.5 49179.375000 4.742910 0.393400 2477.718750 6.625000
KY 1986.5 69095.970588 2.436639 0.181500 3712.764706 5.382353
LA 1986.5 101652.470588 11.196456 0.834000 4223.088235 3.529412
MA 1986.5 132109.176471 3.035118 0.232885 5973.147059 4.588235
MD 1986.5 98233.852941 2.543661 0.196544 4630.058824 4.764706
ME 1986.5 25878.147059 8.673873 0.694303 1174.176471 5.235294
MI 1986.5 211427.117647 6.139256 0.457300 9380.529412 4.558824
MN 1987.5 98194.843750 3.776908 0.312781 4369.312500 6.984375
MO 1987.5 120355.656250 1.673511 0.135000 5147.312500 3.796875
MS 1986.5 54309.264706 12.940641 0.964768 2587.617647 5.897059
MT 1986.5 23113.264706 3.621642 0.287794 815.294118 0.000000
NC 1986.5 126038.911765 16.105477 1.199318 6554.441176 3.367647
ND 1986.5 15923.823529 4.833003 0.360000 647.588235 5.411765
NE 1986.5 39524.617647 4.398752 0.390994 1603.352941 3.888471
NH 1986.5 33263.970588 6.042833 0.536862 1025.823529 0.000000
NJ 1986.5 151812.058824 1.564956 0.149721 7746.117647 5.617647
NM 1986.5 38953.970588 4.713016 0.472835 1467.323529 4.397059
NV 1986.5 39440.941176 2.162150 0.175865 1170.235294 5.112132
NY 1986.5 353700.823529 2.300439 0.221374 18200.617647 3.941176
OH 1986.5 249413.294118 4.913885 0.371435 10950.147059 4.617647
OK 1986.5 57819.441176 10.633824 0.831315 3131.647059 3.235294
OR 1986.5 62440.235294 2.045688 0.169976 2810.235294 0.000000
PA 1986.5 285304.941176 2.416501 0.180000 11993.882353 6.000000
RI 1986.5 23093.794118 2.739664 0.217518 991.558824 6.176471
SC 1986.5 75692.176471 23.198414 1.728000 3387.911765 4.558824
SD 1986.5 15654.676471 8.112361 0.608353 708.176471 4.029412
TN 1986.5 97898.352941 3.531423 0.271576 4866.147059 5.014706
TX 1986.5 444450.558824 5.335410 0.413312 16405.764706 5.003676
UT 1986.5 22289.352941 6.028931 0.575079 1688.323529 4.834559
VA 1986.5 128329.147059 7.745603 0.605824 5949.764706 4.176471
VT 1986.5 13341.676471 7.750310 0.584868 540.764706 3.970588
WA 1986.5 99727.617647 2.896268 0.291253 4664.941176 5.814706
WI 1986.5 147105.176471 1.949311 0.145200 4891.000000 4.617647
WV 1986.5 35606.941176 5.359263 0.399200 1847.058824 4.735294
WY 1987.5 12327.375000 0.557837 0.045000 462.343750 3.250000

You can visualize the entire process (although using only a subset of the data) here. In general, we encourage you to play around with Pandas Tutor to visualize how your code works!

There are a lot of different agg functions you can use! We’ve given a few below, but won’t be surprised if there are more!

Built-in aggregation functions:

  • .mean()

  • .median()

  • .sum()

  • .count()

    • Note: It may appear as if .value_counts() and .count() return the same data. However, while .value_counts() returns a Series sorted from most common to least, the aggregation function .count() returns a DataFrame with the same ordering as the index.

  • .max()

  • .min()

  • .std()

  • .var()

You can also use other functions, such as those defined by NumPy. Examples include:

  • .agg(np.mean)

  • .agg(np.prod)

  • .agg(np.cumsum)

    • returns the cumulative sum; read more here

df.groupby('st_name').agg(np.cumsum)
year c_beer beer_tax btax_dollars population salestax
0 1970 33098 72.341130 2.370 3450 4.0
1 3941 70696 141.645730 4.740 6947 8.0
2 5913 113415 208.794920 7.110 10486 12.0
3 7886 159618 272.011946 9.480 14066 16.0
4 9860 209387 328.945742 11.850 17693 20.0
... ... ... ... ... ... ...
1703 55594 340347 16.654479 1.260 12807 88.0
1704 57594 352942 16.963970 1.305 13301 92.0
1705 59595 365750 17.264899 1.350 13795 96.0
1706 61597 378941 17.561143 1.395 14294 100.0
1707 63600 394476 17.850786 1.440 14795 104.0

1708 rows × 6 columns

Finally, if you like, you can also define your own aggregation function! An example is given below, where we define last_10_vs_first_10 to return the average value in the last 10 years minus the average value in the first 10 years. Remember, you aggregation function must be able to aggregate columns of data into a single value.

def last_10_vs_first_10(obj):
    return obj.iloc[-10:].mean() - obj.iloc[:10].mean()

df.groupby('st_name').agg(last_10_vs_first_10)
year c_beer beer_tax btax_dollars population salestax
st_name
AK 24.0 6799.2 -5.255919 0.387000 262.1 0.0000
AL 24.0 41128.7 -38.866784 0.000000 740.4 0.0000
AR 24.0 20500.6 -8.936081 0.000000 514.8 1.7150
AZ 24.0 69478.1 -1.662230 0.180000 2714.2 1.6800
CA 24.0 173227.7 1.103402 0.360000 11827.6 1.6833
CO 24.0 40179.1 -2.181436 0.029250 1542.9 -0.0300
CT 24.0 2883.6 -1.191767 0.262600 311.3 -0.2500
DC 24.0 -2372.4 -2.397173 0.039200 -136.6 2.2000
DE 24.0 6749.2 -0.899509 0.206800 187.7 0.0000
FL 24.0 194335.7 -13.366342 0.420000 7388.0 2.0000
GA 24.0 88146.5 -37.390833 0.000000 2928.6 1.0000
HI 8.0 -1523.8 -2.805578 0.069790 90.0 0.0000
IA 22.0 4125.4 -3.340308 0.112500 10.3 2.0000
ID 24.0 6536.0 -5.534827 0.000000 438.7 2.0000
IL 24.0 33124.9 -2.352431 0.034680 1011.3 1.4500
IN 24.0 25558.0 -3.001028 0.051340 651.0 1.9000
KS 22.0 7374.4 -4.442635 0.037120 346.3 4.0000
KY 24.0 17057.1 -2.976507 0.000000 546.1 1.0000
LA 24.0 35240.5 -13.677173 0.000000 556.6 1.1000
MA 24.0 1161.4 -3.315096 0.022490 537.7 1.4000
MD 24.0 5272.9 -2.726227 0.020250 1139.4 0.8000
ME 24.0 3304.3 -7.612615 0.225000 202.7 0.6000
MI 24.0 -4159.0 -7.499486 0.000000 767.3 1.9000
MN 22.0 18407.3 -3.508841 0.043600 862.3 4.5000
MO 22.0 27237.9 -1.777010 0.000000 681.2 1.1775
MS 24.0 29677.5 -15.748429 0.000000 422.0 2.0000
MT 24.0 3531.8 -3.334571 0.070710 150.7 0.0000
NC 24.0 81130.0 -19.694798 -0.002319 2354.6 1.0500
ND 24.0 2483.5 -5.903816 0.000000 4.6 3.2000
NE 24.0 5797.6 -1.511346 0.292690 159.3 2.3250
NH 24.0 11805.3 -2.105625 0.374620 390.7 0.0000
NJ 24.0 2076.7 0.169546 0.195100 1005.8 1.2000
NM 24.0 19793.1 2.314072 0.735750 642.7 1.0250
NV 24.0 40810.7 -1.730302 0.067500 1266.5 3.3500
NY 24.0 -43471.5 0.155826 0.247060 804.5 0.2000
OH 24.0 43336.6 -5.628816 0.038880 554.4 1.0000
OK 24.0 22580.2 -10.607037 0.181300 642.7 2.5000
OR 24.0 22796.4 -1.436705 0.063660 1035.0 0.0000
PA 24.0 12249.7 -2.951908 0.000000 376.7 0.0000
RI 24.0 -195.2 -2.530756 0.073200 84.7 1.7000
SC 24.0 50838.6 -28.338316 0.000000 1076.5 1.0000
SD 24.0 6363.1 -9.672514 0.018150 68.3 0.0000
TN 24.0 44875.0 -3.770087 0.038987 1326.5 2.4000
TX 24.0 224717.2 -5.557496 0.074200 7861.6 2.4000
UT 24.0 10895.3 0.418456 0.573400 943.3 0.3125
VA 24.0 50298.4 -7.734889 0.102600 1964.4 0.7000
VT 24.0 2383.1 -8.982539 0.033800 126.3 2.0000
WA 24.0 33204.6 1.306892 0.435141 2147.6 1.6200
WI 24.0 9254.4 -2.381206 0.000000 754.6 1.0000
WV 24.0 11053.0 -6.546676 0.000000 -29.4 3.0000
WY 22.0 369.9 -0.592337 0.000000 80.0 0.8000

As a general tip, whenever you’re trying to calculate differences across a categorical variable, consider whether groupby’s could be helpful. Then, determine which column you would groupby on (normally the categorical column you’re interested in). Finally, determine which aggregation function may be most helpful.

Moreover, if we only want to see the results of the .groupby() on a single column, it is common practice to select that column before the aggregation function to minimize computing speed. For example, if we just want to see the median beer consumption for each state, we could do df.groupby('st_name')['c_beer'].median().

df.groupby('st_name')['c_beer'].median()
st_name
AK     13666.5
AL     74930.5
AR     44731.0
AZ     98964.5
CA    620989.5
CO     83391.0
CT     58610.5
DC     16862.5
DE     16678.0
FL    327415.0
GA    131193.0
HI     29817.0
IA     66664.0
ID     23272.5
IL    280701.5
IN    119674.5
KS     49094.0
KY     71587.5
LA    107196.0
MA    131876.5
MD     98215.5
ME     26079.0
MI    210840.0
MN     98980.5
MO    122805.0
MS     56514.0
MT     23274.0
NC    127415.0
ND     15935.5
NE     39395.5
NH     35645.5
NJ    150161.5
NM     41557.5
NV     36316.0
NY    358784.0
OH    257559.5
OK     60139.0
OR     63020.0
PA    285790.5
RI     22784.5
SC     78973.0
SD     15490.0
TN    101125.0
TX    472811.5
UT     22402.5
VA    138821.0
VT     13601.0
WA    101889.0
WI    147526.5
WV     37934.5
WY     12071.0
Name: c_beer, dtype: float64

Selecting the relevant columns like this is especially important if you may have other strings in your DataFrame and you’re attempting to do an aggregation function which doesn’t work with strings (for example, it makes no sense to try to take the mean of strings). If you try to use an aggregation function on a type of data it cannot work with, your code will error.

df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Filter#

Aggregate functions collapse all the groupby objects into a single row. Filter functions instead return a True/False for each groupby object and only keep the rows for whom the groupby object returned true. For example, let’s say we want to keep all the states where the beer_tax was larger than 50% at least once. We could accomplish that using the following line of code.

df.groupby('st_name').filter(lambda x: (x['beer_tax'] >= 50).any())
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1365 SC 1999 101782 12.283935 1.728 3975 5.0
1366 SC 2000 104116 11.884457 1.728 4023 5.0
1367 SC 2001 105525 11.555637 1.728 4060 5.0
1368 SC 2002 108000 11.375785 1.728 4104 5.0
1369 SC 2003 103058 11.122302 1.728 4147 5.0

102 rows × 7 columns

In the above code, the lambda function defines a Python function that takes in a groupby object named x. The function then extracts the beer_tax Series from that groupby object, compares it to 50 and returns an array of True/Falses indicating whether or not the original value was greater than 50. Then, .any() outputs whether any (in other words: one or more) of those comparisons are true. If yes, the lambda function outputs true (meaning the filter keeps the rows from that groupby object) and if no, the lambda function outputs false (meaning the filter removes the rows from that groupby object). As you can see from the code below, only Alabama, Georgia and South Carolina have ever had beer_tax larger than 50%.

df.groupby('st_name').filter(lambda x: (x['beer_tax'] >= 50).any())['st_name'].unique()
array(['AL', 'GA', 'SC'], dtype=object)

Multiple Columns and Aggregations#

Finally, we can also groupby multiple columns. To demonstrate this, let us first create another categorical variable by adding in a decade column representing which decade the data is from.

df['decade'] = df['year'] // 10 * 10
df
st_name year c_beer beer_tax btax_dollars population salestax decade
0 AL 1970 33098 72.341130 2.370 3450 4.0 1970
1 AL 1971 37598 69.304600 2.370 3497 4.0 1970
2 AL 1972 42719 67.149190 2.370 3539 4.0 1970
3 AL 1973 46203 63.217026 2.370 3580 4.0 1970
4 AL 1974 49769 56.933796 2.370 3627 4.0 1970
... ... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0 1990
1704 WY 2000 12595 0.309491 0.045 494 4.0 2000
1705 WY 2001 12808 0.300928 0.045 494 4.0 2000
1706 WY 2002 13191 0.296244 0.045 499 4.0 2000
1707 WY 2003 15535 0.289643 0.045 501 4.0 2000

1708 rows × 8 columns

Now, since decade is also a categorical variable, we can groupby both the state and the decade.

df.groupby(['st_name','decade']).mean()
year c_beer beer_tax btax_dollars population salestax
st_name decade
AK 1970 1974.5 7715.900 10.603964 0.56250 360.40 0.0
1980 1984.5 13321.300 6.391171 0.70688 498.00 0.0
1990 1994.5 14393.600 5.005872 0.78750 598.50 0.0
2000 2001.5 14537.500 6.272505 1.19250 637.75 0.0
AL 1970 1974.5 50665.600 55.847542 2.37000 3658.50 4.0
... ... ... ... ... ... ... ...
WV 2000 2001.5 40984.500 2.653142 0.39920 1806.00 6.0
WY 1970 1975.5 11349.625 0.989311 0.04500 393.25 3.0
1980 1984.5 13308.000 0.516437 0.04500 488.40 3.0
1990 1994.5 11647.000 0.357562 0.04500 477.70 3.4
2000 2001.5 13532.250 0.299077 0.04500 497.00 4.0

203 rows × 6 columns

Additionally, we can also groupby using multiple aggregation functions.

df.groupby(['st_name','decade']).agg([min,max])
year c_beer beer_tax btax_dollars population salestax
min max min max min max min max min max min max
st_name decade
AK 1970 1970 1979 5372 9848 7.340821 13.735660 0.5625 0.5625 304 405 0.0 0.0
1980 1980 1989 10350 14711 5.522732 7.181146 0.5625 0.7875 405 547 0.0 0.0
1990 1990 1999 13014 15419 4.478518 5.708654 0.7875 0.7875 553 625 0.0 0.0
2000 2000 2003 14256 15008 4.147421 12.396732 0.7875 2.4075 628 649 0.0 0.0
AL 1970 1970 1979 33098 63999 38.661655 72.341130 2.3700 2.3700 3450 3866 4.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
WV 2000 2000 2003 39513 42075 2.569457 2.745529 0.3992 0.3992 1802 1810 6.0 6.0
WY 1970 1972 1979 7950 14246 0.734082 1.274985 0.0450 0.0450 347 454 3.0 3.0
1980 1980 1989 10936 15730 0.429793 0.646776 0.0450 0.0450 458 510 3.0 3.0
1990 1990 1999 11253 12423 0.319894 0.407761 0.0450 0.0450 454 492 3.0 4.0
2000 2000 2003 12595 15535 0.289643 0.309491 0.0450 0.0450 494 501 4.0 4.0

203 rows × 12 columns

Pivot Tables#

While we could groupby multiple columns above, the resulting dataset was honestly a bit messy. If you’re interested in the relationship between two categorical variables and how they affect a third numerical variable, a pivot table is often best. For example, let us use a pivot table to visualize the average beer consumption across the decades for each state.

pd.pivot_table(data=df, index='st_name', columns='decade', 
               values='c_beer', aggfunc=np.mean).head(12)
decade 1970 1980 1990 2000
st_name
AK 7715.9 13321.3 14393.6 14537.50
AL 50665.6 73146.3 85399.2 94585.75
AR 30060.5 43013.7 48867.0 51736.50
AZ 57785.9 91379.6 115175.2 136681.75
CA 460041.2 630967.9 637368.5 653313.75
CO 60350.5 83809.1 90452.1 108031.75
CT 55019.5 63606.1 58895.2 58480.50
DC 17258.4 17290.4 15789.8 14452.25
DE 12163.0 16415.8 17936.4 19900.25
FL 186083.9 309504.6 364002.4 399874.25
GA 80081.0 120273.0 153751.9 178871.50
HI NaN 30242.5 30353.2 28954.00

As you can see, in the pd.pivot_table() method:

  • The data parameter tells the function which DataFrame to get the data from.

  • The index parameter says which categorical variable to use to form the index.

  • The columns parameter says which categorical variable to use to form the columns.

  • The value parameter says which numerical variable to aggregate.

  • The aggfunc parameter says which function to use to perform the aggregation.

  • Optionally, you can set the fill_value parameter to be the value you want to replace the NaN’s with. For example, as we don’t have data on Hawaii in the 1970s, you can see that datapoint is currently a NaN value.