{ "cells": [ { "cell_type": "code", "execution_count": 1, "source": [ "import pandas as pd" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Loading in and Looking at the Data" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Throughout this chapter, we will use `pandas` to explore a dataset on beer sales across 50 continental states in the US. \n", "Above, we have already imported in `pandas` as `pd`, a common nickname for `pandas` (similar to how `numpy` is often abbreviated to `np`).\n", "\n", "The dataset we're using is sourced from [_Salience and Taxation: Theory and Evidence_](https://www.aeaweb.org/articles?id=10.1257/aer.99.4.1145) by Chetty, Looney, and Kroft (AER 2009). The original data can be found at the paper's [replication package](https://www.openicpsr.org/openicpsr/project/113312/version/V1/view), and we have modified it for educational use. The dataset includes 7 columns:\n", "- `st_name`: the state abbreviation\n", "- `year`: the year the data was recorded\n", "- `c_beer`: the quantity of beer consumed, in thousands of gallons\n", "- `beer_tax`: the ad valorem tax, as a percentage\n", "- `btax_dollars`: the excise tax, represented in dollars per case (24 cans) of beer \n", "- `population`: the population of the state, in thousands\n", "- `salestax`: the sales tax percentage\n", "\n", "In this subchapter, we load in the dataset and learn about different ways of viewing the data. Let us begin by loading in the dataset. " ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Loading in the dataset\n", "\n", "Right now, we have the dataset stored as a [comma-separated values (csv) file](https://www.lifewire.com/csv-file-2622708), in the location `data/beer_tax.csv`. We can load it in using `panda`'s [`read_csv` method](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). Below, we load it in and set it equal to the variable `df` (short for DataFrame)." ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "df = pd.read_csv('data/beer_tax.csv')\n", "df" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
0AL19703309872.3411302.37034504.0
1AL19713759869.3046002.37034974.0
2AL19724271967.1491902.37035394.0
3AL19734620363.2170262.37035804.0
4AL19744976956.9337962.37036274.0
........................
1703WY1999124230.3198940.0454924.0
1704WY2000125950.3094910.0454944.0
1705WY2001128080.3009280.0454944.0
1706WY2002131910.2962440.0454994.0
1707WY2003155350.2896430.0455014.0
\n", "

1708 rows × 7 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "0 AL 1970 33098 72.341130 2.370 3450 4.0\n", "1 AL 1971 37598 69.304600 2.370 3497 4.0\n", "2 AL 1972 42719 67.149190 2.370 3539 4.0\n", "3 AL 1973 46203 63.217026 2.370 3580 4.0\n", "4 AL 1974 49769 56.933796 2.370 3627 4.0\n", "... ... ... ... ... ... ... ...\n", "1703 WY 1999 12423 0.319894 0.045 492 4.0\n", "1704 WY 2000 12595 0.309491 0.045 494 4.0\n", "1705 WY 2001 12808 0.300928 0.045 494 4.0\n", "1706 WY 2002 13191 0.296244 0.045 499 4.0\n", "1707 WY 2003 15535 0.289643 0.045 501 4.0\n", "\n", "[1708 rows x 7 columns]" ] }, "metadata": {}, "execution_count": 4 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Whenever we use a dataset, we ought to think about where it came from, how it was generated and what it represents. Doing so is absolutely crucial for good data science. Doing so will help us keep our eyes open for potential sources of bias in the data, avoiding unfortunate mistakes. Additionally, if we do not think about where we source our data from, we cannot truly consider the ethical implications of our analysis.\n", "\n", "Three important metrics you ought to consider with datasets are granularity, scope and temporality;\n", "- Granularity: the granularity of a dataset can best be summarized as what each individual row in the dataset represents. A dataset with finer granularity than another dataset has more detailed records. For example, a dataset where each row represents a residential neighborhood has coarser (the opposite of finer) granularity than a dataset where each row represents a house in a residential neighborhood.\n", "- Scope: the scope of the dataset is the subset of the population of interest that is covered by the data. For example, if you're interested in alcohol information for all US states from 1970 until the present, the beer dataset we're using has a scope of just beer data from 1970-2003.\n", "- Temporality: the temporality of the dataset describes the time element of the dataset. We encourage you to consider which time period the data is measuring, when the data was originally measured and when it was most recently updated. For example, our beer dataset spans from 1970-2003 and was last updated presumably sometime shortly before 2009 (when the paper was published), but we would need to read the paper and it's appendix to determine when the data was first measured.\n", "\n", "Finally, real world datasets are often messy. They have inaccuracies and mistakes. Whenever you obtain a dataset, the one of the first steps you ought to do is to see how 'faithful' the data is to the values it claims to describe. To do so, you should check for clearly incorrect values (e.g. a negative age value), possible typos from data entry, signs of data falsification (e.g. incorrect email addresses), unspecified or incorrect units, etc." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## DataFrames, Series and Indices" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "In `pandas`, all tables are stored as data structures called as `DataFrames`. Each DataFrame has an index (the bolded column on the left) and columns (called as `Series`). \n", "\n", "While the index in our DataFrame ranges from 0 to 1707, this does not have to be the case. In an index, indices don't have to be numbers and they don't have to be unique. For example, let us create a DataFrame `df2` with an index that contains neither numbers nor unique values. Instead, the index is simply `a` for the first 854 rows and `b` for the last 854 rows.\n" ], "metadata": {} }, { "cell_type": "code", "execution_count": 34, "source": [ "df2 = df.copy() # Making a copy of the DataFrame df\n", "df2.index = ['a' for _ in range(int(len(df)/2))] + ['b' for _ in range(int(len(df)/2))]\n", "df2 # The index is neither numerical nor unique" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
aAL19703309872.3411302.37034504.0
aAL19713759869.3046002.37034974.0
aAL19724271967.1491902.37035394.0
aAL19734620363.2170262.37035804.0
aAL19744976956.9337962.37036274.0
........................
bWY1999124230.3198940.0454924.0
bWY2000125950.3094910.0454944.0
bWY2001128080.3009280.0454944.0
bWY2002131910.2962440.0454994.0
bWY2003155350.2896430.0455014.0
\n", "

1708 rows × 7 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "a AL 1970 33098 72.341130 2.370 3450 4.0\n", "a AL 1971 37598 69.304600 2.370 3497 4.0\n", "a AL 1972 42719 67.149190 2.370 3539 4.0\n", "a AL 1973 46203 63.217026 2.370 3580 4.0\n", "a AL 1974 49769 56.933796 2.370 3627 4.0\n", ".. ... ... ... ... ... ... ...\n", "b WY 1999 12423 0.319894 0.045 492 4.0\n", "b WY 2000 12595 0.309491 0.045 494 4.0\n", "b WY 2001 12808 0.300928 0.045 494 4.0\n", "b WY 2002 13191 0.296244 0.045 499 4.0\n", "b WY 2003 15535 0.289643 0.045 501 4.0\n", "\n", "[1708 rows x 7 columns]" ] }, "metadata": {}, "execution_count": 34 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "\n", "Each Series is simply a 1D ordering of data, similar to a NumPy array. For example, we pull out the Series of `btax_dollars` in `df` in the cell below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "df['btax_dollars']" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 2.370\n", "1 2.370\n", "2 2.370\n", "3 2.370\n", "4 2.370\n", " ... \n", "1703 0.045\n", "1704 0.045\n", "1705 0.045\n", "1706 0.045\n", "1707 0.045\n", "Name: btax_dollars, Length: 1708, dtype: float64" ] }, "metadata": {}, "execution_count": 5 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "As you can see above, to pull out a singular Series from a DataFrame, we simply need to follow the notation `DataFrame_Name[Series_Name]`.\n", "\n", "However, if we're instead interested in pulling out multiple Series from our DataFrame, we're essentially interested in pulling out another, smaller DataFrame from the first DataFrame (since multiple Series = DataFrame). So, instead of passing in a single Series name into the notation `df[...]`, we instead pass in a list containing the names of the Series we're interested in. An example is below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "df[['st_name','year','btax_dollars']]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearbtax_dollars
0AL19702.370
1AL19712.370
2AL19722.370
3AL19732.370
4AL19742.370
............
1703WY19990.045
1704WY20000.045
1705WY20010.045
1706WY20020.045
1707WY20030.045
\n", "

1708 rows × 3 columns

\n", "
" ], "text/plain": [ " st_name year btax_dollars\n", "0 AL 1970 2.370\n", "1 AL 1971 2.370\n", "2 AL 1972 2.370\n", "3 AL 1973 2.370\n", "4 AL 1974 2.370\n", "... ... ... ...\n", "1703 WY 1999 0.045\n", "1704 WY 2000 0.045\n", "1705 WY 2001 0.045\n", "1706 WY 2002 0.045\n", "1707 WY 2003 0.045\n", "\n", "[1708 rows x 3 columns]" ] }, "metadata": {}, "execution_count": 7 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Series Attributes" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "When we look at a Series, we get the index on the left and the corresponding values on the right. You can view the index via the attribute `index` and the values via the attribute `values`. Examples are shown below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "df['btax_dollars'].index \n", "# How can you interpret the index here? Hint: look at the prerequiste chapter on NumPy." ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RangeIndex(start=0, stop=1708, step=1)" ] }, "metadata": {}, "execution_count": 8 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "df['btax_dollars'].values" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([2.37 , 2.37 , 2.37 , ..., 0.045, 0.045, 0.045])" ] }, "metadata": {}, "execution_count": 9 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### DataFrame Attributes" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Three important DataFrame attributes are `index`, `columns`, and `shape`. `index` simply returns the index of the dataset, as shown below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "df.index" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RangeIndex(start=0, stop=1708, step=1)" ] }, "metadata": {}, "execution_count": 10 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "`columns` returns the name of the columns/Series that make up the dataset. Note that each column/Series name must be unique." ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "df.columns" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['st_name', 'year', 'c_beer', 'beer_tax', 'btax_dollars', 'population',\n", " 'salestax'],\n", " dtype='object')" ] }, "metadata": {}, "execution_count": 11 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "`shape` returns the dimensions, or the number of rows and the number of columns (in that order), of the dataset." ], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "df.shape" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(1708, 7)" ] }, "metadata": {}, "execution_count": 12 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Moreover, `size` is another attributes that returns the number of rows times the number of columns, or the total number of datapoints, in the dataset." ], "metadata": {} }, { "cell_type": "code", "execution_count": 49, "source": [ "df.size" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "11956" ] }, "metadata": {}, "execution_count": 49 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Finally, `dtypes` returns the data type of each column in the DataFrame." ], "metadata": {} }, { "cell_type": "code", "execution_count": 50, "source": [ "df.dtypes" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "st_name object\n", "year int64\n", "c_beer int64\n", "beer_tax float64\n", "btax_dollars float64\n", "population int64\n", "salestax float64\n", "dtype: object" ] }, "metadata": {}, "execution_count": 50 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Note: the data type for the `st_name` column is `object` even though the column contains strings. Click [here](https://stackoverflow.com/a/21020411) to learn why this happens." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Extracting Data using Labels and Indices" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "There are four primary methods of looking at extracting specific rows/columns from DataFrames. They are :-\n", "- `.head` and `.tail` \n", "- `.loc` \n", "- `.iloc`\n", "- `[]`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 1. `.head` and `.tail`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "If we just want to select the first or last n rows from a dataset, we can use `DataFrame_Name.head(n)` (for the first n rows) or `DataFrame_Name.tail(n)` (for the last n rows)." ], "metadata": {} }, { "cell_type": "code", "execution_count": 13, "source": [ "df.head(5) # Extract the first 5 rows" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
0AL19703309872.3411302.3734504.0
1AL19713759869.3046002.3734974.0
2AL19724271967.1491902.3735394.0
3AL19734620363.2170262.3735804.0
4AL19744976956.9337962.3736274.0
\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "0 AL 1970 33098 72.341130 2.37 3450 4.0\n", "1 AL 1971 37598 69.304600 2.37 3497 4.0\n", "2 AL 1972 42719 67.149190 2.37 3539 4.0\n", "3 AL 1973 46203 63.217026 2.37 3580 4.0\n", "4 AL 1974 49769 56.933796 2.37 3627 4.0" ] }, "metadata": {}, "execution_count": 13 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 14, "source": [ "df.tail(5) # Extract the last 5 rows" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
1703WY1999124230.3198940.0454924.0
1704WY2000125950.3094910.0454944.0
1705WY2001128080.3009280.0454944.0
1706WY2002131910.2962440.0454994.0
1707WY2003155350.2896430.0455014.0
\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "1703 WY 1999 12423 0.319894 0.045 492 4.0\n", "1704 WY 2000 12595 0.309491 0.045 494 4.0\n", "1705 WY 2001 12808 0.300928 0.045 494 4.0\n", "1706 WY 2002 13191 0.296244 0.045 499 4.0\n", "1707 WY 2003 15535 0.289643 0.045 501 4.0" ] }, "metadata": {}, "execution_count": 14 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 2. `.loc`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "To better illustrate the difference between `.loc` and `iloc`, we will use the DataFrame `df2`, as created earlier. We've reposted it here for ease of access." ], "metadata": {} }, { "cell_type": "code", "execution_count": 35, "source": [ "df2" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
aAL19703309872.3411302.37034504.0
aAL19713759869.3046002.37034974.0
aAL19724271967.1491902.37035394.0
aAL19734620363.2170262.37035804.0
aAL19744976956.9337962.37036274.0
........................
bWY1999124230.3198940.0454924.0
bWY2000125950.3094910.0454944.0
bWY2001128080.3009280.0454944.0
bWY2002131910.2962440.0454994.0
bWY2003155350.2896430.0455014.0
\n", "

1708 rows × 7 columns

\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "a AL 1970 33098 72.341130 2.370 3450 4.0\n", "a AL 1971 37598 69.304600 2.370 3497 4.0\n", "a AL 1972 42719 67.149190 2.370 3539 4.0\n", "a AL 1973 46203 63.217026 2.370 3580 4.0\n", "a AL 1974 49769 56.933796 2.370 3627 4.0\n", ".. ... ... ... ... ... ... ...\n", "b WY 1999 12423 0.319894 0.045 492 4.0\n", "b WY 2000 12595 0.309491 0.045 494 4.0\n", "b WY 2001 12808 0.300928 0.045 494 4.0\n", "b WY 2002 13191 0.296244 0.045 499 4.0\n", "b WY 2003 15535 0.289643 0.045 501 4.0\n", "\n", "[1708 rows x 7 columns]" ] }, "metadata": {}, "execution_count": 35 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "`.loc` helps us get values based on the index/column labels associated with them. For example, let's say we're interested in all the `btax_dollars` values associated with an index label of `a`. We can use `.loc` to get all such values, as shown below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 36, "source": [ "df2.loc['a','btax_dollars'] " ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "a 2.370\n", "a 2.370\n", "a 2.370\n", "a 2.370\n", "a 2.370\n", " ... \n", "a 0.135\n", "a 0.135\n", "a 0.135\n", "a 0.135\n", "a 0.135\n", "Name: btax_dollars, Length: 854, dtype: float64" ] }, "metadata": {}, "execution_count": 36 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Above, we first passed in the row labels (or index labels) we're interested in, followed by the column labels we're interested in. If we're interested in multiple row/column labels, we must pass in a list of the labels we're interested in. An example is shown below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 38, "source": [ "df2.loc['a',['st_name','year','btax_dollars']]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearbtax_dollars
aAL19702.370
aAL19712.370
aAL19722.370
aAL19732.370
aAL19742.370
............
aMO19930.135
aMO19940.135
aMO19950.135
aMO19960.135
aMO19970.135
\n", "

854 rows × 3 columns

\n", "
" ], "text/plain": [ " st_name year btax_dollars\n", "a AL 1970 2.370\n", "a AL 1971 2.370\n", "a AL 1972 2.370\n", "a AL 1973 2.370\n", "a AL 1974 2.370\n", ".. ... ... ...\n", "a MO 1993 0.135\n", "a MO 1994 0.135\n", "a MO 1995 0.135\n", "a MO 1996 0.135\n", "a MO 1997 0.135\n", "\n", "[854 rows x 3 columns]" ] }, "metadata": {}, "execution_count": 38 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "We can also use a slice (`:`) to specify that we're interested in all the labels." ], "metadata": {} }, { "cell_type": "code", "execution_count": 41, "source": [ "df2.loc[:,['st_name','year','btax_dollars']]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearbtax_dollars
aAL19702.370
aAL19712.370
aAL19722.370
aAL19732.370
aAL19742.370
............
bWY19990.045
bWY20000.045
bWY20010.045
bWY20020.045
bWY20030.045
\n", "

1708 rows × 3 columns

\n", "
" ], "text/plain": [ " st_name year btax_dollars\n", "a AL 1970 2.370\n", "a AL 1971 2.370\n", "a AL 1972 2.370\n", "a AL 1973 2.370\n", "a AL 1974 2.370\n", ".. ... ... ...\n", "b WY 1999 0.045\n", "b WY 2000 0.045\n", "b WY 2001 0.045\n", "b WY 2002 0.045\n", "b WY 2003 0.045\n", "\n", "[1708 rows x 3 columns]" ] }, "metadata": {}, "execution_count": 41 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 3. `.iloc`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "`.iloc` works similarly to `.loc`, but it looks at the integers associated with each label rather than the label itself. For example, let's extract all the `btax_dollars` values associated with an index label of `a` using `.iloc`." ], "metadata": {} }, { "cell_type": "code", "execution_count": 42, "source": [ "df2.iloc[0:854,4]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "a 2.370\n", "a 2.370\n", "a 2.370\n", "a 2.370\n", "a 2.370\n", " ... \n", "a 0.135\n", "a 0.135\n", "a 0.135\n", "a 0.135\n", "a 0.135\n", "Name: btax_dollars, Length: 854, dtype: float64" ] }, "metadata": {}, "execution_count": 42 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Here, we extracted the same Series as with `.loc`, but needed to specify the index positions of the rows/columns we were interested in rather than their labels. We passed in `0:854` to indicate that we were interested in the first 854 rows, and `4` to indicate that we were interested in the fifth column.\n", "\n", "Similar to `.loc`, we can pass in single values, lists or slices to the arguments of `.iloc`. However, everything we pass in must be an integer (as we're looking at the index positions). Another example is below." ], "metadata": {} }, { "cell_type": "code", "execution_count": 43, "source": [ "df2.iloc[0:854,[0, 1, 4]]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearbtax_dollars
aAL19702.370
aAL19712.370
aAL19722.370
aAL19732.370
aAL19742.370
............
aMO19930.135
aMO19940.135
aMO19950.135
aMO19960.135
aMO19970.135
\n", "

854 rows × 3 columns

\n", "
" ], "text/plain": [ " st_name year btax_dollars\n", "a AL 1970 2.370\n", "a AL 1971 2.370\n", "a AL 1972 2.370\n", "a AL 1973 2.370\n", "a AL 1974 2.370\n", ".. ... ... ...\n", "a MO 1993 0.135\n", "a MO 1994 0.135\n", "a MO 1995 0.135\n", "a MO 1996 0.135\n", "a MO 1997 0.135\n", "\n", "[854 rows x 3 columns]" ] }, "metadata": {}, "execution_count": 43 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 4. `[]`" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "The `[]` is the hardest to describe, but the easiest to write and for many, the easiest to use. The `[]` simply follows the notation `DataFrame_Name[...]`, where `...` can be a single column label (similar to `.loc`), a list of column labels (similar to `.loc`) or a slice of row index positions (similar to `.iloc`). We've included examples for all three below. Do not worry if this seems confusing at first, feel free to use one of the first three methods instead." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "#### Single Column Label" ], "metadata": {} }, { "cell_type": "code", "execution_count": 46, "source": [ "df2['btax_dollars']" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "a 2.370\n", "a 2.370\n", "a 2.370\n", "a 2.370\n", "a 2.370\n", " ... \n", "b 0.045\n", "b 0.045\n", "b 0.045\n", "b 0.045\n", "b 0.045\n", "Name: btax_dollars, Length: 1708, dtype: float64" ] }, "metadata": {}, "execution_count": 46 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "#### List of Column Labels" ], "metadata": {} }, { "cell_type": "code", "execution_count": 47, "source": [ "df2[['st_name','year','btax_dollars']]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearbtax_dollars
aAL19702.370
aAL19712.370
aAL19722.370
aAL19732.370
aAL19742.370
............
bWY19990.045
bWY20000.045
bWY20010.045
bWY20020.045
bWY20030.045
\n", "

1708 rows × 3 columns

\n", "
" ], "text/plain": [ " st_name year btax_dollars\n", "a AL 1970 2.370\n", "a AL 1971 2.370\n", "a AL 1972 2.370\n", "a AL 1973 2.370\n", "a AL 1974 2.370\n", ".. ... ... ...\n", "b WY 1999 0.045\n", "b WY 2000 0.045\n", "b WY 2001 0.045\n", "b WY 2002 0.045\n", "b WY 2003 0.045\n", "\n", "[1708 rows x 3 columns]" ] }, "metadata": {}, "execution_count": 47 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "#### Slice of Row Index Positions" ], "metadata": {} }, { "cell_type": "code", "execution_count": 48, "source": [ "df2[0:5]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st_nameyearc_beerbeer_taxbtax_dollarspopulationsalestax
aAL19703309872.3411302.3734504.0
aAL19713759869.3046002.3734974.0
aAL19724271967.1491902.3735394.0
aAL19734620363.2170262.3735804.0
aAL19744976956.9337962.3736274.0
\n", "
" ], "text/plain": [ " st_name year c_beer beer_tax btax_dollars population salestax\n", "a AL 1970 33098 72.341130 2.37 3450 4.0\n", "a AL 1971 37598 69.304600 2.37 3497 4.0\n", "a AL 1972 42719 67.149190 2.37 3539 4.0\n", "a AL 1973 46203 63.217026 2.37 3580 4.0\n", "a AL 1974 49769 56.933796 2.37 3627 4.0" ] }, "metadata": {}, "execution_count": 48 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "In the next subchapter, we will talk about conditional selection, which allows us to view all rows that satisfy a certain condition." ], "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sklearn-env]", "language": "python", "name": "conda-env-sklearn-env-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }