{ "cells": [ { "cell_type": "code", "execution_count": 2, "id": "c50a3092-6573-4990-8d06-2ae6bd0748ff", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "8258d89f-b23d-41d8-a46e-5b8a6d968407", "metadata": {}, "source": [ "# GroupBys and Pivot Tables" ] }, { "cell_type": "markdown", "id": "ece57ddf-1af5-4dbf-9dbb-adbd9c55893f", "metadata": {}, "source": [ "This subchapter goes over groupby's and pivot table's, two incredibly useful `pandas` methods.\n", "\n", "To start, let's load in the same dataset as the first subchapter in the `Pandas` chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from [_Salience and Taxation: Theory and Evidence_](https://www.aeaweb.org/articles?id=10.1257/aer.99.4.1145) by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:\n", "- `st_name`: the state abbreviation\n", "- `year`: the year the data was recorded\n", "- `c_beer`: the quantity of beer consumed, in thousands of gallons\n", "- `beer_tax`: the ad valorem tax, as a percentage\n", "- `btax_dollars`: the excise tax, represented in dollars per case (24 cans) of beer \n", "- `population`: the population of the state, in thousands\n", "- `salestax`: the sales tax percentage" ] }, { "cell_type": "code", "execution_count": 3, "id": "6ee4a46a-00e3-422a-bd95-84a6011fa873", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
0AL19703309872.3411302.37034504.0
1AL19713759869.3046002.37034974.0
2AL19724271967.1491902.37035394.0
3AL19734620363.2170262.37035804.0
4AL19744976956.9337962.37036274.0
........................
1703WY1999124230.3198940.0454924.0
1704WY2000125950.3094910.0454944.0
1705WY2001128080.3009280.0454944.0
1706WY2002131910.2962440.0454994.0
1707WY2003155350.2896430.0455014.0
\n", "

1708 rows × 7 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "0 AL 1970 33098 72.341130 2.370 3450 4.0\n", "1 AL 1971 37598 69.304600 2.370 3497 4.0\n", "2 AL 1972 42719 67.149190 2.370 3539 4.0\n", "3 AL 1973 46203 63.217026 2.370 3580 4.0\n", "4 AL 1974 49769 56.933796 2.370 3627 4.0\n", "... ... ... ... ... ... ... ...\n", "1703 WY 1999 12423 0.319894 0.045 492 4.0\n", "1704 WY 2000 12595 0.309491 0.045 494 4.0\n", "1705 WY 2001 12808 0.300928 0.045 494 4.0\n", "1706 WY 2002 13191 0.296244 0.045 499 4.0\n", "1707 WY 2003 15535 0.289643 0.045 501 4.0\n", "\n", "[1708 rows x 7 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/beer_tax.csv')\n", "df" ] }, { "cell_type": "markdown", "id": "2c81424c-e09d-433f-a57c-233c8ad1daa1", "metadata": {}, "source": [ "## GroupBys" ] }, { "cell_type": "markdown", "id": "8b1d8d65-c48b-49be-9841-99e3bbdfe13a", "metadata": {}, "source": [ "Let's say we're interested in how the average tax differs by state. Right now, we have a fair amount of data for every state, so it'd great if we could combine the data for each state somehow. `.groupby()` helps us do exactly that.\n", "\n", "When we groupby on a column 'x', we create `groupby objects` for each unique value in the column. Each groupby object contains all the data corresponding to that unique value from the original DataFrame. The code below visualizes the first 5 rows from the groupby objects corresponding to the first 5 states in our dataset." ] }, { "cell_type": "code", "execution_count": 4, "id": "b0e0a166-ad9f-4c01-8992-b3f78c8973c8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The first 5 rows from the subframe for AK are: \n", " st_name year c_beer beer_tax btax_dollars population salestax\n", "34 AK 1970 5372 13.735660 0.5625 304 0.0\n", "35 AK 1971 6336 13.159102 0.5625 316 0.0\n", "36 AK 1972 6038 12.749847 0.5625 324 0.0\n", "37 AK 1973 6453 12.003234 0.5625 331 0.0\n", "38 AK 1974 7598 10.810215 0.5625 341 0.0\n", "________________________________________________________________________\n", "The first 5 rows from the subframe for AL are: \n", " st_name year c_beer beer_tax btax_dollars population salestax\n", "0 AL 1970 33098 72.341130 2.37 3450 4.0\n", "1 AL 1971 37598 69.304600 2.37 3497 4.0\n", "2 AL 1972 42719 67.149190 2.37 3539 4.0\n", "3 AL 1973 46203 63.217026 2.37 3580 4.0\n", "4 AL 1974 49769 56.933796 2.37 3627 4.0\n", "________________________________________________________________________\n", "The first 5 rows from the subframe for AR are: \n", " st_name year c_beer beer_tax btax_dollars population salestax\n", "102 AR 1970 22378 16.632357 0.5449 1930 3.0\n", "103 AR 1971 25020 15.934210 0.5449 1972 3.0\n", "104 AR 1972 25614 15.438649 0.5449 2019 3.0\n", "105 AR 1973 27946 14.534582 0.5449 2059 3.0\n", "106 AR 1974 30915 13.089970 0.5449 2101 3.0\n", "________________________________________________________________________\n", "The first 5 rows from the subframe for AZ are: \n", " st_name year c_beer beer_tax btax_dollars population salestax\n", "68 AZ 1970 38604 5.494264 0.18 1795 3.0\n", "69 AZ 1971 41837 5.263641 0.18 1896 3.0\n", "70 AZ 1972 47949 5.099939 0.18 2008 3.0\n", "71 AZ 1973 53380 4.801294 0.18 2124 3.0\n", "72 AZ 1974 58188 4.324086 0.18 2223 3.0\n", "________________________________________________________________________\n", "The first 5 rows from the subframe for CA are: \n", " st_name year c_beer beer_tax btax_dollars population salestax\n", "136 CA 1970 363645 2.747132 0.09 20023 5.000\n", "137 CA 1971 380397 2.631820 0.09 20346 5.000\n", "138 CA 1972 401928 2.549970 0.09 20585 5.000\n", "139 CA 1973 417463 2.400647 0.09 20869 5.167\n", "140 CA 1974 464237 2.162043 0.09 21174 5.250\n", "________________________________________________________________________\n", "The first 5 rows from the subframe for CO are: \n", " st_name year c_beer beer_tax btax_dollars population salestax\n", "170 CO 1970 42145 4.120698 0.135 2224 3.0\n", "171 CO 1971 45359 3.947731 0.135 2304 3.0\n", "172 CO 1972 50444 3.824955 0.135 2405 3.0\n", "173 CO 1973 55332 3.600970 0.135 2496 3.0\n", "174 CO 1974 60162 3.243065 0.135 2541 3.0\n", "________________________________________________________________________\n" ] } ], "source": [ "i = 0\n", "for name, frame in df.groupby('st_name'):\n", " if i > 5:\n", " continue\n", " print(f'The first 5 rows from the subframe for {name} are: ')\n", " print(frame.head())\n", " print('________________________________________________________________________')\n", " i += 1" ] }, { "cell_type": "markdown", "id": "074e3b66-9be6-465f-87cb-d4aae6e4415f", "metadata": {}, "source": [ "### Aggregate" ] }, { "cell_type": "markdown", "id": "b295c5fe-a133-4757-9793-86f62f080abf", "metadata": {}, "source": [ "Now, we wish to collapse all the groupby objects into a single row, and return a DataFrame that has all the single rows across all groupby objects. We can do that with the help of an aggregation function which tells `pandas` how to combine the rows. For example, the aggregation function `.mean()` tells `pandas` to take the mean of the rows in the groupby object. We do this below.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "4a31e48e-4c58-486b-b277-5956bdbe6556", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearc_beerbeer_taxbtax_dollarspopulationsalestax
st_name
AK1986.512131.1176477.2088260.745259503.5294120.000000
AL1986.572660.41176531.8172662.3700004022.2941184.000000
AR1986.541951.7058827.3152870.5449002359.4411763.813235
AZ1986.593827.4705883.2823130.2832353490.0000004.494118
CA1986.5585206.7941182.1890790.22235327467.5882356.320794
CO1986.581713.0588242.1898920.1713973282.9411763.013235
CT1986.559092.0588243.4576700.3075913228.5000006.617647
DC1986.516505.7352942.3200710.180018633.6764716.823529
DE1986.516022.1470592.5575500.226297659.9705880.000000
FL1986.5299864.88235315.9816121.29058811903.1764715.073529
GA1986.5125192.50000030.6090172.2800006346.9411763.411765
HI1994.530017.66666714.0830351.9944001168.8333334.000000
IA1987.567063.1875004.4021030.3765252868.0937503.968750
ID1986.522516.9705884.5309400.3375001025.8235294.117647
IL1986.5271348.9705882.1822290.16770011664.4705885.595588
IN1986.5113699.1764713.0782380.2414945619.7058824.382353
KS1987.549179.3750004.7429100.3934002477.7187506.625000
KY1986.569095.9705882.4366390.1815003712.7647065.382353
LA1986.5101652.47058811.1964560.8340004223.0882353.529412
MA1986.5132109.1764713.0351180.2328855973.1470594.588235
MD1986.598233.8529412.5436610.1965444630.0588244.764706
ME1986.525878.1470598.6738730.6943031174.1764715.235294
MI1986.5211427.1176476.1392560.4573009380.5294124.558824
MN1987.598194.8437503.7769080.3127814369.3125006.984375
MO1987.5120355.6562501.6735110.1350005147.3125003.796875
MS1986.554309.26470612.9406410.9647682587.6176475.897059
MT1986.523113.2647063.6216420.287794815.2941180.000000
NC1986.5126038.91176516.1054771.1993186554.4411763.367647
ND1986.515923.8235294.8330030.360000647.5882355.411765
NE1986.539524.6176474.3987520.3909941603.3529413.888471
NH1986.533263.9705886.0428330.5368621025.8235290.000000
NJ1986.5151812.0588241.5649560.1497217746.1176475.617647
NM1986.538953.9705884.7130160.4728351467.3235294.397059
NV1986.539440.9411762.1621500.1758651170.2352945.112132
NY1986.5353700.8235292.3004390.22137418200.6176473.941176
OH1986.5249413.2941184.9138850.37143510950.1470594.617647
OK1986.557819.44117610.6338240.8313153131.6470593.235294
OR1986.562440.2352942.0456880.1699762810.2352940.000000
PA1986.5285304.9411762.4165010.18000011993.8823536.000000
RI1986.523093.7941182.7396640.217518991.5588246.176471
SC1986.575692.17647123.1984141.7280003387.9117654.558824
SD1986.515654.6764718.1123610.608353708.1764714.029412
TN1986.597898.3529413.5314230.2715764866.1470595.014706
TX1986.5444450.5588245.3354100.41331216405.7647065.003676
UT1986.522289.3529416.0289310.5750791688.3235294.834559
VA1986.5128329.1470597.7456030.6058245949.7647064.176471
VT1986.513341.6764717.7503100.584868540.7647063.970588
WA1986.599727.6176472.8962680.2912534664.9411765.814706
WI1986.5147105.1764711.9493110.1452004891.0000004.617647
WV1986.535606.9411765.3592630.3992001847.0588244.735294
WY1987.512327.3750000.5578370.045000462.3437503.250000
\n", "
" ], "text/plain": [ " year c_beer beer_tax btax_dollars population \\\n", "st_name \n", "AK 1986.5 12131.117647 7.208826 0.745259 503.529412 \n", "AL 1986.5 72660.411765 31.817266 2.370000 4022.294118 \n", "AR 1986.5 41951.705882 7.315287 0.544900 2359.441176 \n", "AZ 1986.5 93827.470588 3.282313 0.283235 3490.000000 \n", "CA 1986.5 585206.794118 2.189079 0.222353 27467.588235 \n", "CO 1986.5 81713.058824 2.189892 0.171397 3282.941176 \n", "CT 1986.5 59092.058824 3.457670 0.307591 3228.500000 \n", "DC 1986.5 16505.735294 2.320071 0.180018 633.676471 \n", "DE 1986.5 16022.147059 2.557550 0.226297 659.970588 \n", "FL 1986.5 299864.882353 15.981612 1.290588 11903.176471 \n", "GA 1986.5 125192.500000 30.609017 2.280000 6346.941176 \n", "HI 1994.5 30017.666667 14.083035 1.994400 1168.833333 \n", "IA 1987.5 67063.187500 4.402103 0.376525 2868.093750 \n", "ID 1986.5 22516.970588 4.530940 0.337500 1025.823529 \n", "IL 1986.5 271348.970588 2.182229 0.167700 11664.470588 \n", "IN 1986.5 113699.176471 3.078238 0.241494 5619.705882 \n", "KS 1987.5 49179.375000 4.742910 0.393400 2477.718750 \n", "KY 1986.5 69095.970588 2.436639 0.181500 3712.764706 \n", "LA 1986.5 101652.470588 11.196456 0.834000 4223.088235 \n", "MA 1986.5 132109.176471 3.035118 0.232885 5973.147059 \n", "MD 1986.5 98233.852941 2.543661 0.196544 4630.058824 \n", "ME 1986.5 25878.147059 8.673873 0.694303 1174.176471 \n", "MI 1986.5 211427.117647 6.139256 0.457300 9380.529412 \n", "MN 1987.5 98194.843750 3.776908 0.312781 4369.312500 \n", "MO 1987.5 120355.656250 1.673511 0.135000 5147.312500 \n", "MS 1986.5 54309.264706 12.940641 0.964768 2587.617647 \n", "MT 1986.5 23113.264706 3.621642 0.287794 815.294118 \n", "NC 1986.5 126038.911765 16.105477 1.199318 6554.441176 \n", "ND 1986.5 15923.823529 4.833003 0.360000 647.588235 \n", "NE 1986.5 39524.617647 4.398752 0.390994 1603.352941 \n", "NH 1986.5 33263.970588 6.042833 0.536862 1025.823529 \n", "NJ 1986.5 151812.058824 1.564956 0.149721 7746.117647 \n", "NM 1986.5 38953.970588 4.713016 0.472835 1467.323529 \n", "NV 1986.5 39440.941176 2.162150 0.175865 1170.235294 \n", "NY 1986.5 353700.823529 2.300439 0.221374 18200.617647 \n", "OH 1986.5 249413.294118 4.913885 0.371435 10950.147059 \n", "OK 1986.5 57819.441176 10.633824 0.831315 3131.647059 \n", "OR 1986.5 62440.235294 2.045688 0.169976 2810.235294 \n", "PA 1986.5 285304.941176 2.416501 0.180000 11993.882353 \n", "RI 1986.5 23093.794118 2.739664 0.217518 991.558824 \n", "SC 1986.5 75692.176471 23.198414 1.728000 3387.911765 \n", "SD 1986.5 15654.676471 8.112361 0.608353 708.176471 \n", "TN 1986.5 97898.352941 3.531423 0.271576 4866.147059 \n", "TX 1986.5 444450.558824 5.335410 0.413312 16405.764706 \n", "UT 1986.5 22289.352941 6.028931 0.575079 1688.323529 \n", "VA 1986.5 128329.147059 7.745603 0.605824 5949.764706 \n", "VT 1986.5 13341.676471 7.750310 0.584868 540.764706 \n", "WA 1986.5 99727.617647 2.896268 0.291253 4664.941176 \n", "WI 1986.5 147105.176471 1.949311 0.145200 4891.000000 \n", "WV 1986.5 35606.941176 5.359263 0.399200 1847.058824 \n", "WY 1987.5 12327.375000 0.557837 0.045000 462.343750 \n", "\n", " salestax \n", "st_name \n", "AK 0.000000 \n", "AL 4.000000 \n", "AR 3.813235 \n", "AZ 4.494118 \n", "CA 6.320794 \n", "CO 3.013235 \n", "CT 6.617647 \n", "DC 6.823529 \n", "DE 0.000000 \n", "FL 5.073529 \n", "GA 3.411765 \n", "HI 4.000000 \n", "IA 3.968750 \n", "ID 4.117647 \n", "IL 5.595588 \n", "IN 4.382353 \n", "KS 6.625000 \n", "KY 5.382353 \n", "LA 3.529412 \n", "MA 4.588235 \n", "MD 4.764706 \n", "ME 5.235294 \n", "MI 4.558824 \n", "MN 6.984375 \n", "MO 3.796875 \n", "MS 5.897059 \n", "MT 0.000000 \n", "NC 3.367647 \n", "ND 5.411765 \n", "NE 3.888471 \n", "NH 0.000000 \n", "NJ 5.617647 \n", "NM 4.397059 \n", "NV 5.112132 \n", "NY 3.941176 \n", "OH 4.617647 \n", "OK 3.235294 \n", "OR 0.000000 \n", "PA 6.000000 \n", "RI 6.176471 \n", "SC 4.558824 \n", "SD 4.029412 \n", "TN 5.014706 \n", "TX 5.003676 \n", "UT 4.834559 \n", "VA 4.176471 \n", "VT 3.970588 \n", "WA 5.814706 \n", "WI 4.617647 \n", "WV 4.735294 \n", "WY 3.250000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('st_name').mean()" ] }, { "cell_type": "markdown", "id": "e3d33a1b-8543-46c9-8ec2-b12e9f55ece8", "metadata": {}, "source": [ "You can visualize the entire process (although using only a subset of the data) [here](https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0A%0Acsv%20%3D%20'''%0A,st_name,year,c_beer,beer_tax,btax_dollars,population,salestax%0A0,AL,1970,33098,72.34113,2.37,3450,4.0%0A1,AL,1971,37598,69.3046,2.37,3497,4.0%0A2,AL,1972,42719,67.14919,2.37,3539,4.0%0A34,AK,1970,5372,13.73566,0.5625,304,0.0%0A35,AK,1971,6336,13.159101500000002,0.5625,316,0.0%0A36,AK,1972,6038,12.749847,0.5625,324,0.0%0A68,AZ,1970,38604,5.494264,0.18,1795,3.0%0A69,AZ,1971,41837,5.263641000000001,0.18,1896,3.0%0A70,AZ,1972,47949,5.0999393,0.18,2008,3.0%0A136,CA,1970,363645,2.747132,0.09,20023,5.0%0A137,CA,1971,380397,2.6318203999999996,0.09,20346,5.0%0A138,CA,1972,401928,2.5499697,0.09,20585,5.0%0A'''%0A%0Adf%20%3D%20pd.read_csv%28io.StringIO%28csv%29%29%0Adf.groupby%28'st_name'%29.mean%28%29&d=2024-01-04&lang=py&v=v1). In general, we encourage you to play around with [Pandas Tutor](https://pandastutor.com/) to visualize how your code works!\n", "\n", "There are a lot of different `agg` functions you can use! We've given a few below, but won't be surprised if there are more!\n", "\n", "Built-in aggregation functions:\n", "- `.mean()`\n", "- `.median()`\n", "- `.sum()`\n", "- `.count()` \n", " - Note: It may appear as if `.value_counts()` and `.count()` return the same data. However, while `.value_counts()` returns a Series sorted from most common to least, the aggregation function `.count()` returns a DataFrame with the same ordering as the index.\n", "- `.max()`\n", "- `.min()`\n", "- `.std()`\n", "- `.var()`\n", "\n", "You can also use other functions, such as those defined by `NumPy`. Examples include:\n", "- `.agg(np.mean)`\n", "- `.agg(np.prod)`\n", "- `.agg(np.cumsum)`\n", " - returns the cumulative sum; read more [here](https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html#numpy.cumsum)\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "bcb04d1c-c837-4d8f-9b7f-152d00288574", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearc_beerbeer_taxbtax_dollarspopulationsalestax
019703309872.3411302.37034504.0
1394170696141.6457304.74069478.0
25913113415208.7949207.1101048612.0
37886159618272.0119469.4801406616.0
49860209387328.94574211.8501769320.0
.....................
17035559434034716.6544791.2601280788.0
17045759435294216.9639701.3051330192.0
17055959536575017.2648991.3501379596.0
17066159737894117.5611431.39514294100.0
17076360039447617.8507861.44014795104.0
\n", "

1708 rows × 6 columns

\n", "
" ], "text/plain": [ " year c_beer beer_tax btax_dollars population salestax\n", "0 1970 33098 72.341130 2.370 3450 4.0\n", "1 3941 70696 141.645730 4.740 6947 8.0\n", "2 5913 113415 208.794920 7.110 10486 12.0\n", "3 7886 159618 272.011946 9.480 14066 16.0\n", "4 9860 209387 328.945742 11.850 17693 20.0\n", "... ... ... ... ... ... ...\n", "1703 55594 340347 16.654479 1.260 12807 88.0\n", "1704 57594 352942 16.963970 1.305 13301 92.0\n", "1705 59595 365750 17.264899 1.350 13795 96.0\n", "1706 61597 378941 17.561143 1.395 14294 100.0\n", "1707 63600 394476 17.850786 1.440 14795 104.0\n", "\n", "[1708 rows x 6 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('st_name').agg(np.cumsum)" ] }, { "cell_type": "markdown", "id": "19f0c4de-0c8c-47b5-805e-940f946cb6f4", "metadata": {}, "source": [ "Finally, if you like, you can also define your own aggregation function! An example is given below, where we define `last_10_vs_first_10` to return the average value in the last 10 years minus the average value in the first 10 years. Remember, you aggregation function must be able to aggregate columns of data into a single value." ] }, { "cell_type": "code", "execution_count": 7, "id": "5767b3d1-f1f2-4a4a-b6f5-6a8190f69deb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearc_beerbeer_taxbtax_dollarspopulationsalestax
st_name
AK24.06799.2-5.2559190.387000262.10.0000
AL24.041128.7-38.8667840.000000740.40.0000
AR24.020500.6-8.9360810.000000514.81.7150
AZ24.069478.1-1.6622300.1800002714.21.6800
CA24.0173227.71.1034020.36000011827.61.6833
CO24.040179.1-2.1814360.0292501542.9-0.0300
CT24.02883.6-1.1917670.262600311.3-0.2500
DC24.0-2372.4-2.3971730.039200-136.62.2000
DE24.06749.2-0.8995090.206800187.70.0000
FL24.0194335.7-13.3663420.4200007388.02.0000
GA24.088146.5-37.3908330.0000002928.61.0000
HI8.0-1523.8-2.8055780.06979090.00.0000
IA22.04125.4-3.3403080.11250010.32.0000
ID24.06536.0-5.5348270.000000438.72.0000
IL24.033124.9-2.3524310.0346801011.31.4500
IN24.025558.0-3.0010280.051340651.01.9000
KS22.07374.4-4.4426350.037120346.34.0000
KY24.017057.1-2.9765070.000000546.11.0000
LA24.035240.5-13.6771730.000000556.61.1000
MA24.01161.4-3.3150960.022490537.71.4000
MD24.05272.9-2.7262270.0202501139.40.8000
ME24.03304.3-7.6126150.225000202.70.6000
MI24.0-4159.0-7.4994860.000000767.31.9000
MN22.018407.3-3.5088410.043600862.34.5000
MO22.027237.9-1.7770100.000000681.21.1775
MS24.029677.5-15.7484290.000000422.02.0000
MT24.03531.8-3.3345710.070710150.70.0000
NC24.081130.0-19.694798-0.0023192354.61.0500
ND24.02483.5-5.9038160.0000004.63.2000
NE24.05797.6-1.5113460.292690159.32.3250
NH24.011805.3-2.1056250.374620390.70.0000
NJ24.02076.70.1695460.1951001005.81.2000
NM24.019793.12.3140720.735750642.71.0250
NV24.040810.7-1.7303020.0675001266.53.3500
NY24.0-43471.50.1558260.247060804.50.2000
OH24.043336.6-5.6288160.038880554.41.0000
OK24.022580.2-10.6070370.181300642.72.5000
OR24.022796.4-1.4367050.0636601035.00.0000
PA24.012249.7-2.9519080.000000376.70.0000
RI24.0-195.2-2.5307560.07320084.71.7000
SC24.050838.6-28.3383160.0000001076.51.0000
SD24.06363.1-9.6725140.01815068.30.0000
TN24.044875.0-3.7700870.0389871326.52.4000
TX24.0224717.2-5.5574960.0742007861.62.4000
UT24.010895.30.4184560.573400943.30.3125
VA24.050298.4-7.7348890.1026001964.40.7000
VT24.02383.1-8.9825390.033800126.32.0000
WA24.033204.61.3068920.4351412147.61.6200
WI24.09254.4-2.3812060.000000754.61.0000
WV24.011053.0-6.5466760.000000-29.43.0000
WY22.0369.9-0.5923370.00000080.00.8000
\n", "
" ], "text/plain": [ " year c_beer beer_tax btax_dollars population salestax\n", "st_name \n", "AK 24.0 6799.2 -5.255919 0.387000 262.1 0.0000\n", "AL 24.0 41128.7 -38.866784 0.000000 740.4 0.0000\n", "AR 24.0 20500.6 -8.936081 0.000000 514.8 1.7150\n", "AZ 24.0 69478.1 -1.662230 0.180000 2714.2 1.6800\n", "CA 24.0 173227.7 1.103402 0.360000 11827.6 1.6833\n", "CO 24.0 40179.1 -2.181436 0.029250 1542.9 -0.0300\n", "CT 24.0 2883.6 -1.191767 0.262600 311.3 -0.2500\n", "DC 24.0 -2372.4 -2.397173 0.039200 -136.6 2.2000\n", "DE 24.0 6749.2 -0.899509 0.206800 187.7 0.0000\n", "FL 24.0 194335.7 -13.366342 0.420000 7388.0 2.0000\n", "GA 24.0 88146.5 -37.390833 0.000000 2928.6 1.0000\n", "HI 8.0 -1523.8 -2.805578 0.069790 90.0 0.0000\n", "IA 22.0 4125.4 -3.340308 0.112500 10.3 2.0000\n", "ID 24.0 6536.0 -5.534827 0.000000 438.7 2.0000\n", "IL 24.0 33124.9 -2.352431 0.034680 1011.3 1.4500\n", "IN 24.0 25558.0 -3.001028 0.051340 651.0 1.9000\n", "KS 22.0 7374.4 -4.442635 0.037120 346.3 4.0000\n", "KY 24.0 17057.1 -2.976507 0.000000 546.1 1.0000\n", "LA 24.0 35240.5 -13.677173 0.000000 556.6 1.1000\n", "MA 24.0 1161.4 -3.315096 0.022490 537.7 1.4000\n", "MD 24.0 5272.9 -2.726227 0.020250 1139.4 0.8000\n", "ME 24.0 3304.3 -7.612615 0.225000 202.7 0.6000\n", "MI 24.0 -4159.0 -7.499486 0.000000 767.3 1.9000\n", "MN 22.0 18407.3 -3.508841 0.043600 862.3 4.5000\n", "MO 22.0 27237.9 -1.777010 0.000000 681.2 1.1775\n", "MS 24.0 29677.5 -15.748429 0.000000 422.0 2.0000\n", "MT 24.0 3531.8 -3.334571 0.070710 150.7 0.0000\n", "NC 24.0 81130.0 -19.694798 -0.002319 2354.6 1.0500\n", "ND 24.0 2483.5 -5.903816 0.000000 4.6 3.2000\n", "NE 24.0 5797.6 -1.511346 0.292690 159.3 2.3250\n", "NH 24.0 11805.3 -2.105625 0.374620 390.7 0.0000\n", "NJ 24.0 2076.7 0.169546 0.195100 1005.8 1.2000\n", "NM 24.0 19793.1 2.314072 0.735750 642.7 1.0250\n", "NV 24.0 40810.7 -1.730302 0.067500 1266.5 3.3500\n", "NY 24.0 -43471.5 0.155826 0.247060 804.5 0.2000\n", "OH 24.0 43336.6 -5.628816 0.038880 554.4 1.0000\n", "OK 24.0 22580.2 -10.607037 0.181300 642.7 2.5000\n", "OR 24.0 22796.4 -1.436705 0.063660 1035.0 0.0000\n", "PA 24.0 12249.7 -2.951908 0.000000 376.7 0.0000\n", "RI 24.0 -195.2 -2.530756 0.073200 84.7 1.7000\n", "SC 24.0 50838.6 -28.338316 0.000000 1076.5 1.0000\n", "SD 24.0 6363.1 -9.672514 0.018150 68.3 0.0000\n", "TN 24.0 44875.0 -3.770087 0.038987 1326.5 2.4000\n", "TX 24.0 224717.2 -5.557496 0.074200 7861.6 2.4000\n", "UT 24.0 10895.3 0.418456 0.573400 943.3 0.3125\n", "VA 24.0 50298.4 -7.734889 0.102600 1964.4 0.7000\n", "VT 24.0 2383.1 -8.982539 0.033800 126.3 2.0000\n", "WA 24.0 33204.6 1.306892 0.435141 2147.6 1.6200\n", "WI 24.0 9254.4 -2.381206 0.000000 754.6 1.0000\n", "WV 24.0 11053.0 -6.546676 0.000000 -29.4 3.0000\n", "WY 22.0 369.9 -0.592337 0.000000 80.0 0.8000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def last_10_vs_first_10(obj):\n", " return obj.iloc[-10:].mean() - obj.iloc[:10].mean()\n", "\n", "df.groupby('st_name').agg(last_10_vs_first_10)" ] }, { "cell_type": "markdown", "id": "56762743-91d0-4004-81a9-4611b590ec08", "metadata": {}, "source": [ "As a general tip, whenever you're trying to calculate differences across a categorical variable, consider whether groupby's could be helpful. Then, determine which column you would groupby on (normally the categorical column you're interested in). Finally, determine which aggregation function may be most helpful.\n", "\n", "Moreover, if we only want to see the results of the `.groupby()` on a single column, it is common practice to select that column before the aggregation function to minimize computing speed. For example, if we just want to see the median beer consumption for each state, we could do `df.groupby('st_name')['c_beer'].median()`." ] }, { "cell_type": "code", "execution_count": 8, "id": "547cd9be-bdd5-4daa-9292-7943c27c23ad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "st_name\n", "AK 13666.5\n", "AL 74930.5\n", "AR 44731.0\n", "AZ 98964.5\n", "CA 620989.5\n", "CO 83391.0\n", "CT 58610.5\n", "DC 16862.5\n", "DE 16678.0\n", "FL 327415.0\n", "GA 131193.0\n", "HI 29817.0\n", "IA 66664.0\n", "ID 23272.5\n", "IL 280701.5\n", "IN 119674.5\n", "KS 49094.0\n", "KY 71587.5\n", "LA 107196.0\n", "MA 131876.5\n", "MD 98215.5\n", "ME 26079.0\n", "MI 210840.0\n", "MN 98980.5\n", "MO 122805.0\n", "MS 56514.0\n", "MT 23274.0\n", "NC 127415.0\n", "ND 15935.5\n", "NE 39395.5\n", "NH 35645.5\n", "NJ 150161.5\n", "NM 41557.5\n", "NV 36316.0\n", "NY 358784.0\n", "OH 257559.5\n", "OK 60139.0\n", "OR 63020.0\n", "PA 285790.5\n", "RI 22784.5\n", "SC 78973.0\n", "SD 15490.0\n", "TN 101125.0\n", "TX 472811.5\n", "UT 22402.5\n", "VA 138821.0\n", "VT 13601.0\n", "WA 101889.0\n", "WI 147526.5\n", "WV 37934.5\n", "WY 12071.0\n", "Name: c_beer, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('st_name')['c_beer'].median()" ] }, { "cell_type": "markdown", "id": "5853ff6d-b94b-40fe-8cf8-eaf5469cfb8a", "metadata": {}, "source": [ "Selecting the relevant columns like this is especially important if you may have other strings in your DataFrame and you're attempting to do an aggregation function which doesn't work with strings (for example, it makes no sense to try to take the mean of strings). If you try to use an aggregation function on a type of data it cannot work with, your code will error.\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "73cec168-4f12-4527-9e7d-a416ff86207b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
0AL19703309872.3411302.37034504.0
1AL19713759869.3046002.37034974.0
2AL19724271967.1491902.37035394.0
3AL19734620363.2170262.37035804.0
4AL19744976956.9337962.37036274.0
........................
1703WY1999124230.3198940.0454924.0
1704WY2000125950.3094910.0454944.0
1705WY2001128080.3009280.0454944.0
1706WY2002131910.2962440.0454994.0
1707WY2003155350.2896430.0455014.0
\n", "

1708 rows × 7 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "0 AL 1970 33098 72.341130 2.370 3450 4.0\n", "1 AL 1971 37598 69.304600 2.370 3497 4.0\n", "2 AL 1972 42719 67.149190 2.370 3539 4.0\n", "3 AL 1973 46203 63.217026 2.370 3580 4.0\n", "4 AL 1974 49769 56.933796 2.370 3627 4.0\n", "... ... ... ... ... ... ... ...\n", "1703 WY 1999 12423 0.319894 0.045 492 4.0\n", "1704 WY 2000 12595 0.309491 0.045 494 4.0\n", "1705 WY 2001 12808 0.300928 0.045 494 4.0\n", "1706 WY 2002 13191 0.296244 0.045 499 4.0\n", "1707 WY 2003 15535 0.289643 0.045 501 4.0\n", "\n", "[1708 rows x 7 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "f734c515-ef1f-44fb-b6d5-8f59b400224d", "metadata": {}, "source": [ "### Filter\n", "\n", "Aggregate functions collapse all the groupby objects into a single row. Filter functions instead return a `True/False` for each groupby object and only keep the rows for whom the groupby object returned true. For example, let's say we want to keep all the states where the `beer_tax` was larger than 50% at least once. We could accomplish that using the following line of code." ] }, { "cell_type": "code", "execution_count": 17, "id": "7f088748-7678-4c32-81ea-053cee8dd882", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
0AL19703309872.3411302.37034504.0
1AL19713759869.3046002.37034974.0
2AL19724271967.1491902.37035394.0
3AL19734620363.2170262.37035804.0
4AL19744976956.9337962.37036274.0
........................
1365SC199910178212.2839351.72839755.0
1366SC200010411611.8844571.72840235.0
1367SC200110552511.5556371.72840605.0
1368SC200210800011.3757851.72841045.0
1369SC200310305811.1223021.72841475.0
\n", "

102 rows × 7 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "0 AL 1970 33098 72.341130 2.370 3450 4.0\n", "1 AL 1971 37598 69.304600 2.370 3497 4.0\n", "2 AL 1972 42719 67.149190 2.370 3539 4.0\n", "3 AL 1973 46203 63.217026 2.370 3580 4.0\n", "4 AL 1974 49769 56.933796 2.370 3627 4.0\n", "... ... ... ... ... ... ... ...\n", "1365 SC 1999 101782 12.283935 1.728 3975 5.0\n", "1366 SC 2000 104116 11.884457 1.728 4023 5.0\n", "1367 SC 2001 105525 11.555637 1.728 4060 5.0\n", "1368 SC 2002 108000 11.375785 1.728 4104 5.0\n", "1369 SC 2003 103058 11.122302 1.728 4147 5.0\n", "\n", "[102 rows x 7 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('st_name').filter(lambda x: (x['beer_tax'] >= 50).any())" ] }, { "cell_type": "markdown", "id": "3b15cd57-808b-4ce3-8c4b-4e37f5797e7d", "metadata": {}, "source": [ "In the above code, the [`lambda` function](https://www.freecodecamp.org/news/python-lambda-function-explained/#:~:text=They're%20commonly%20referred%20to,to%20use%20the%20function%20once.) defines a Python function that takes in a groupby object named `x`. The function then extracts the `beer_tax` Series from that groupby object, compares it to 50 and returns an array of `True/Falses` indicating whether or not the original value was greater than 50. Then, `.any()` outputs whether any (in other words: one or more) of those comparisons are true. If yes, the lambda function outputs true (meaning the filter keeps the rows from that groupby object) and if no, the lambda function outputs false (meaning the filter removes the rows from that groupby object). As you can see from the code below, only Alabama, Georgia and South Carolina have ever had `beer_tax` larger than 50%." ] }, { "cell_type": "code", "execution_count": 18, "id": "48d225ec-bd63-4f23-841f-1d91d3c7f873", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['AL', 'GA', 'SC'], dtype=object)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('st_name').filter(lambda x: (x['beer_tax'] >= 50).any())['st_name'].unique()" ] }, { "cell_type": "markdown", "id": "21d1cd0c-82b7-470c-b177-bbefa853140a", "metadata": {}, "source": [ "### Multiple Columns and Aggregations\n", "\n", "Finally, we can also groupby multiple columns. To demonstrate this, let us first create another categorical variable by adding in a `decade` column representing which decade the data is from." ] }, { "cell_type": "code", "execution_count": 64, "id": "91de7d04-c1de-4fc3-b63f-a3c1feceb939", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestaxdecade
0AL19703309872.3411302.37034504.01970
1AL19713759869.3046002.37034974.01970
2AL19724271967.1491902.37035394.01970
3AL19734620363.2170262.37035804.01970
4AL19744976956.9337962.37036274.01970
...........................
1703WY1999124230.3198940.0454924.01990
1704WY2000125950.3094910.0454944.02000
1705WY2001128080.3009280.0454944.02000
1706WY2002131910.2962440.0454994.02000
1707WY2003155350.2896430.0455014.02000
\n", "

1708 rows × 8 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax \\\n", "0 AL 1970 33098 72.341130 2.370 3450 4.0 \n", "1 AL 1971 37598 69.304600 2.370 3497 4.0 \n", "2 AL 1972 42719 67.149190 2.370 3539 4.0 \n", "3 AL 1973 46203 63.217026 2.370 3580 4.0 \n", "4 AL 1974 49769 56.933796 2.370 3627 4.0 \n", "... ... ... ... ... ... ... ... \n", "1703 WY 1999 12423 0.319894 0.045 492 4.0 \n", "1704 WY 2000 12595 0.309491 0.045 494 4.0 \n", "1705 WY 2001 12808 0.300928 0.045 494 4.0 \n", "1706 WY 2002 13191 0.296244 0.045 499 4.0 \n", "1707 WY 2003 15535 0.289643 0.045 501 4.0 \n", "\n", " decade \n", "0 1970 \n", "1 1970 \n", "2 1970 \n", "3 1970 \n", "4 1970 \n", "... ... \n", "1703 1990 \n", "1704 2000 \n", "1705 2000 \n", "1706 2000 \n", "1707 2000 \n", "\n", "[1708 rows x 8 columns]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['decade'] = df['year'] // 10 * 10\n", "df" ] }, { "cell_type": "markdown", "id": "08db23b6-c9a8-46fe-8815-ec317c334fbb", "metadata": {}, "source": [ "Now, since `decade` is also a categorical variable, we can groupby both the state and the decade." ] }, { "cell_type": "code", "execution_count": 69, "id": "94c3b95a-985f-43da-b103-7a6a96882b26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearc_beerbeer_taxbtax_dollarspopulationsalestax
st_namedecade
AK19701974.57715.90010.6039640.56250360.400.0
19801984.513321.3006.3911710.70688498.000.0
19901994.514393.6005.0058720.78750598.500.0
20002001.514537.5006.2725051.19250637.750.0
AL19701974.550665.60055.8475422.370003658.504.0
........................
WV20002001.540984.5002.6531420.399201806.006.0
WY19701975.511349.6250.9893110.04500393.253.0
19801984.513308.0000.5164370.04500488.403.0
19901994.511647.0000.3575620.04500477.703.4
20002001.513532.2500.2990770.04500497.004.0
\n", "

203 rows × 6 columns

\n", "
" ], "text/plain": [ " year c_beer beer_tax btax_dollars population \\\n", "st_name decade \n", "AK 1970 1974.5 7715.900 10.603964 0.56250 360.40 \n", " 1980 1984.5 13321.300 6.391171 0.70688 498.00 \n", " 1990 1994.5 14393.600 5.005872 0.78750 598.50 \n", " 2000 2001.5 14537.500 6.272505 1.19250 637.75 \n", "AL 1970 1974.5 50665.600 55.847542 2.37000 3658.50 \n", "... ... ... ... ... ... \n", "WV 2000 2001.5 40984.500 2.653142 0.39920 1806.00 \n", "WY 1970 1975.5 11349.625 0.989311 0.04500 393.25 \n", " 1980 1984.5 13308.000 0.516437 0.04500 488.40 \n", " 1990 1994.5 11647.000 0.357562 0.04500 477.70 \n", " 2000 2001.5 13532.250 0.299077 0.04500 497.00 \n", "\n", " salestax \n", "st_name decade \n", "AK 1970 0.0 \n", " 1980 0.0 \n", " 1990 0.0 \n", " 2000 0.0 \n", "AL 1970 4.0 \n", "... ... \n", "WV 2000 6.0 \n", "WY 1970 3.0 \n", " 1980 3.0 \n", " 1990 3.4 \n", " 2000 4.0 \n", "\n", "[203 rows x 6 columns]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['st_name','decade']).mean()" ] }, { "cell_type": "markdown", "id": "9052059c-291c-41ad-9bc4-7815142bda61", "metadata": {}, "source": [ "Additionally, we can also groupby using multiple aggregation functions." ] }, { "cell_type": "code", "execution_count": 70, "id": "31baebd0-57da-4349-a9e2-87e14eff1747", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearc_beerbeer_taxbtax_dollarspopulationsalestax
minmaxminmaxminmaxminmaxminmaxminmax
st_namedecade
AK197019701979537298487.34082113.7356600.56250.56253044050.00.0
19801980198910350147115.5227327.1811460.56250.78754055470.00.0
19901990199913014154194.4785185.7086540.78750.78755536250.00.0
20002000200314256150084.14742112.3967320.78752.40756286490.00.0
AL197019701979330986399938.66165572.3411302.37002.3700345038664.04.0
..........................................
WV20002000200339513420752.5694572.7455290.39920.3992180218106.06.0
WY1970197219797950142460.7340821.2749850.04500.04503474543.03.0
19801980198910936157300.4297930.6467760.04500.04504585103.03.0
19901990199911253124230.3198940.4077610.04500.04504544923.04.0
20002000200312595155350.2896430.3094910.04500.04504945014.04.0
\n", "

203 rows × 12 columns

\n", "
" ], "text/plain": [ " year c_beer beer_tax btax_dollars \\\n", " min max min max min max min \n", "st_name decade \n", "AK 1970 1970 1979 5372 9848 7.340821 13.735660 0.5625 \n", " 1980 1980 1989 10350 14711 5.522732 7.181146 0.5625 \n", " 1990 1990 1999 13014 15419 4.478518 5.708654 0.7875 \n", " 2000 2000 2003 14256 15008 4.147421 12.396732 0.7875 \n", "AL 1970 1970 1979 33098 63999 38.661655 72.341130 2.3700 \n", "... ... ... ... ... ... ... ... \n", "WV 2000 2000 2003 39513 42075 2.569457 2.745529 0.3992 \n", "WY 1970 1972 1979 7950 14246 0.734082 1.274985 0.0450 \n", " 1980 1980 1989 10936 15730 0.429793 0.646776 0.0450 \n", " 1990 1990 1999 11253 12423 0.319894 0.407761 0.0450 \n", " 2000 2000 2003 12595 15535 0.289643 0.309491 0.0450 \n", "\n", " population salestax \n", " max min max min max \n", "st_name decade \n", "AK 1970 0.5625 304 405 0.0 0.0 \n", " 1980 0.7875 405 547 0.0 0.0 \n", " 1990 0.7875 553 625 0.0 0.0 \n", " 2000 2.4075 628 649 0.0 0.0 \n", "AL 1970 2.3700 3450 3866 4.0 4.0 \n", "... ... ... ... ... ... \n", "WV 2000 0.3992 1802 1810 6.0 6.0 \n", "WY 1970 0.0450 347 454 3.0 3.0 \n", " 1980 0.0450 458 510 3.0 3.0 \n", " 1990 0.0450 454 492 3.0 4.0 \n", " 2000 0.0450 494 501 4.0 4.0 \n", "\n", "[203 rows x 12 columns]" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['st_name','decade']).agg([min,max])" ] }, { "cell_type": "markdown", "id": "f0a64d2d-ae0f-4316-a6fb-b527f6bed050", "metadata": {}, "source": [ "## Pivot Tables" ] }, { "cell_type": "markdown", "id": "6422daf1-f8f2-4ca8-9960-b594ea434dbe", "metadata": {}, "source": [ "While we could groupby multiple columns above, the resulting dataset was honestly a bit messy. If you're interested in the relationship between two categorical variables and how they affect a third numerical variable, a pivot table is often best. For example, let us use a pivot table to visualize the average beer consumption across the decades for each state." ] }, { "cell_type": "code", "execution_count": 81, "id": "7aa85e33-3561-4491-a687-21f6add342f2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decade1970198019902000
st_name
AK7715.913321.314393.614537.50
AL50665.673146.385399.294585.75
AR30060.543013.748867.051736.50
AZ57785.991379.6115175.2136681.75
CA460041.2630967.9637368.5653313.75
CO60350.583809.190452.1108031.75
CT55019.563606.158895.258480.50
DC17258.417290.415789.814452.25
DE12163.016415.817936.419900.25
FL186083.9309504.6364002.4399874.25
GA80081.0120273.0153751.9178871.50
HINaN30242.530353.228954.00
\n", "
" ], "text/plain": [ "decade 1970 1980 1990 2000\n", "st_name \n", "AK 7715.9 13321.3 14393.6 14537.50\n", "AL 50665.6 73146.3 85399.2 94585.75\n", "AR 30060.5 43013.7 48867.0 51736.50\n", "AZ 57785.9 91379.6 115175.2 136681.75\n", "CA 460041.2 630967.9 637368.5 653313.75\n", "CO 60350.5 83809.1 90452.1 108031.75\n", "CT 55019.5 63606.1 58895.2 58480.50\n", "DC 17258.4 17290.4 15789.8 14452.25\n", "DE 12163.0 16415.8 17936.4 19900.25\n", "FL 186083.9 309504.6 364002.4 399874.25\n", "GA 80081.0 120273.0 153751.9 178871.50\n", "HI NaN 30242.5 30353.2 28954.00" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(data=df, index='st_name', columns='decade', \n", " values='c_beer', aggfunc=np.mean).head(12)" ] }, { "cell_type": "markdown", "id": "76aa7ddc-c8d6-4649-b263-888ff0c8952e", "metadata": {}, "source": [ "As you can see, in the `pd.pivot_table()` method:\n", "- The `data` parameter tells the function which DataFrame to get the data from.\n", "- The `index` parameter says which categorical variable to use to form the index.\n", "- The `columns` parameter says which categorical variable to use to form the columns.\n", "- The `value` parameter says which numerical variable to aggregate.\n", "- The `aggfunc` parameter says which function to use to perform the aggregation.\n", "- Optionally, you can set the `fill_value` parameter to be the value you want to replace the `NaN`'s with. For example, as we don't have data on Hawaii in the 1970s, you can see that datapoint is currently a `NaN` value.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sklearn-env] *", "language": "python", "name": "conda-env-sklearn-env-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }