import pandas as pd
Loading in and Looking at the Data#
Throughout this chapter, we will use pandas
to explore a dataset on beer sales across 50 continental states in the US.
Above, we have already imported in pandas
as pd
, a common nickname for pandas
(similar to how numpy
is often abbreviated to np
).
The dataset we’re using is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009). The original data can be found at the paper’s replication package, and we have modified it for educational use. The dataset includes 7 columns:
st_name
: the state abbreviationyear
: the year the data was recordedc_beer
: the quantity of beer consumed, in thousands of gallonsbeer_tax
: the ad valorem tax, as a percentagebtax_dollars
: the excise tax, represented in dollars per case (24 cans) of beerpopulation
: the population of the state, in thousandssalestax
: the sales tax percentage
In this subchapter, we load in the dataset and learn about different ways of viewing the data. Let us begin by loading in the dataset.
Loading in the dataset#
Right now, we have the dataset stored as a comma-separated values (csv) file, in the location data/beer_tax.csv
. We can load it in using panda
’s read_csv
method. Below, we load it in and set it equal to the variable df
(short for DataFrame).
df = pd.read_csv('data/beer_tax.csv')
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Whenever we use a dataset, we ought to think about where it came from, how it was generated and what it represents. Doing so is absolutely crucial for good data science. Doing so will help us keep our eyes open for potential sources of bias in the data, avoiding unfortunate mistakes. Additionally, if we do not think about where we source our data from, we cannot truly consider the ethical implications of our analysis.
Three important metrics you ought to consider with datasets are granularity, scope and temporality;
Granularity: the granularity of a dataset can best be summarized as what each individual row in the dataset represents. A dataset with finer granularity than another dataset has more detailed records. For example, a dataset where each row represents a residential neighborhood has coarser (the opposite of finer) granularity than a dataset where each row represents a house in a residential neighborhood.
Scope: the scope of the dataset is the subset of the population of interest that is covered by the data. For example, if you’re interested in alcohol information for all US states from 1970 until the present, the beer dataset we’re using has a scope of just beer data from 1970-2003.
Temporality: the temporality of the dataset describes the time element of the dataset. We encourage you to consider which time period the data is measuring, when the data was originally measured and when it was most recently updated. For example, our beer dataset spans from 1970-2003 and was last updated presumably sometime shortly before 2009 (when the paper was published), but we would need to read the paper and it’s appendix to determine when the data was first measured.
Finally, real world datasets are often messy. They have inaccuracies and mistakes. Whenever you obtain a dataset, the one of the first steps you ought to do is to see how ‘faithful’ the data is to the values it claims to describe. To do so, you should check for clearly incorrect values (e.g. a negative age value), possible typos from data entry, signs of data falsification (e.g. incorrect email addresses), unspecified or incorrect units, etc.
DataFrames, Series and Indices#
In pandas
, all tables are stored as data structures called as DataFrames
. Each DataFrame has an index (the bolded column on the left) and columns (called as Series
).
While the index in our DataFrame ranges from 0 to 1707, this does not have to be the case. In an index, indices don’t have to be numbers and they don’t have to be unique. For example, let us create a DataFrame df2
with an index that contains neither numbers nor unique values. Instead, the index is simply a
for the first 854 rows and b
for the last 854 rows.
df2 = df.copy() # Making a copy of the DataFrame df
df2.index = ['a' for _ in range(int(len(df)/2))] + ['b' for _ in range(int(len(df)/2))]
df2 # The index is neither numerical nor unique
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
a | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
a | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
a | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
a | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
a | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
b | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
b | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
b | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
b | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
b | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Each Series is simply a 1D ordering of data, similar to a NumPy array. For example, we pull out the Series of btax_dollars
in df
in the cell below.
df['btax_dollars']
0 2.370
1 2.370
2 2.370
3 2.370
4 2.370
...
1703 0.045
1704 0.045
1705 0.045
1706 0.045
1707 0.045
Name: btax_dollars, Length: 1708, dtype: float64
As you can see above, to pull out a singular Series from a DataFrame, we simply need to follow the notation DataFrame_Name[Series_Name]
.
However, if we’re instead interested in pulling out multiple Series from our DataFrame, we’re essentially interested in pulling out another, smaller DataFrame from the first DataFrame (since multiple Series = DataFrame). So, instead of passing in a single Series name into the notation df[...]
, we instead pass in a list containing the names of the Series we’re interested in. An example is below.
df[['st_name','year','btax_dollars']]
st_name | year | btax_dollars | |
---|---|---|---|
0 | AL | 1970 | 2.370 |
1 | AL | 1971 | 2.370 |
2 | AL | 1972 | 2.370 |
3 | AL | 1973 | 2.370 |
4 | AL | 1974 | 2.370 |
... | ... | ... | ... |
1703 | WY | 1999 | 0.045 |
1704 | WY | 2000 | 0.045 |
1705 | WY | 2001 | 0.045 |
1706 | WY | 2002 | 0.045 |
1707 | WY | 2003 | 0.045 |
1708 rows × 3 columns
Series Attributes#
When we look at a Series, we get the index on the left and the corresponding values on the right. You can view the index via the attribute index
and the values via the attribute values
. Examples are shown below.
df['btax_dollars'].index
# How can you interpret the index here? Hint: look at the prerequiste chapter on NumPy.
RangeIndex(start=0, stop=1708, step=1)
df['btax_dollars'].values
array([2.37 , 2.37 , 2.37 , ..., 0.045, 0.045, 0.045])
DataFrame Attributes#
Three important DataFrame attributes are index
, columns
, and shape
. index
simply returns the index of the dataset, as shown below.
df.index
RangeIndex(start=0, stop=1708, step=1)
columns
returns the name of the columns/Series that make up the dataset. Note that each column/Series name must be unique.
df.columns
Index(['st_name', 'year', 'c_beer', 'beer_tax', 'btax_dollars', 'population',
'salestax'],
dtype='object')
shape
returns the dimensions, or the number of rows and the number of columns (in that order), of the dataset.
df.shape
(1708, 7)
Moreover, size
is another attributes that returns the number of rows times the number of columns, or the total number of datapoints, in the dataset.
df.size
11956
Finally, dtypes
returns the data type of each column in the DataFrame.
df.dtypes
st_name object
year int64
c_beer int64
beer_tax float64
btax_dollars float64
population int64
salestax float64
dtype: object
Note: the data type for the st_name
column is object
even though the column contains strings. Click here to learn why this happens.
Extracting Data using Labels and Indices#
There are four primary methods of looking at extracting specific rows/columns from DataFrames. They are :-
.head
and.tail
.loc
.iloc
[]
1. .head
and .tail
#
If we just want to select the first or last n rows from a dataset, we can use DataFrame_Name.head(n)
(for the first n rows) or DataFrame_Name.tail(n)
(for the last n rows).
df.head(5) # Extract the first 5 rows
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.37 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.37 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.37 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.37 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.37 | 3627 | 4.0 |
df.tail(5) # Extract the last 5 rows
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
2. .loc
#
To better illustrate the difference between .loc
and iloc
, we will use the DataFrame df2
, as created earlier. We’ve reposted it here for ease of access.
df2
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
a | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
a | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
a | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
a | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
a | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
b | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
b | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
b | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
b | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
b | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
.loc
helps us get values based on the index/column labels associated with them. For example, let’s say we’re interested in all the btax_dollars
values associated with an index label of a
. We can use .loc
to get all such values, as shown below.
df2.loc['a','btax_dollars']
a 2.370
a 2.370
a 2.370
a 2.370
a 2.370
...
a 0.135
a 0.135
a 0.135
a 0.135
a 0.135
Name: btax_dollars, Length: 854, dtype: float64
Above, we first passed in the row labels (or index labels) we’re interested in, followed by the column labels we’re interested in. If we’re interested in multiple row/column labels, we must pass in a list of the labels we’re interested in. An example is shown below.
df2.loc['a',['st_name','year','btax_dollars']]
st_name | year | btax_dollars | |
---|---|---|---|
a | AL | 1970 | 2.370 |
a | AL | 1971 | 2.370 |
a | AL | 1972 | 2.370 |
a | AL | 1973 | 2.370 |
a | AL | 1974 | 2.370 |
... | ... | ... | ... |
a | MO | 1993 | 0.135 |
a | MO | 1994 | 0.135 |
a | MO | 1995 | 0.135 |
a | MO | 1996 | 0.135 |
a | MO | 1997 | 0.135 |
854 rows × 3 columns
We can also use a slice (:
) to specify that we’re interested in all the labels.
df2.loc[:,['st_name','year','btax_dollars']]
st_name | year | btax_dollars | |
---|---|---|---|
a | AL | 1970 | 2.370 |
a | AL | 1971 | 2.370 |
a | AL | 1972 | 2.370 |
a | AL | 1973 | 2.370 |
a | AL | 1974 | 2.370 |
... | ... | ... | ... |
b | WY | 1999 | 0.045 |
b | WY | 2000 | 0.045 |
b | WY | 2001 | 0.045 |
b | WY | 2002 | 0.045 |
b | WY | 2003 | 0.045 |
1708 rows × 3 columns
3. .iloc
#
.iloc
works similarly to .loc
, but it looks at the integers associated with each label rather than the label itself. For example, let’s extract all the btax_dollars
values associated with an index label of a
using .iloc
.
df2.iloc[0:854,4]
a 2.370
a 2.370
a 2.370
a 2.370
a 2.370
...
a 0.135
a 0.135
a 0.135
a 0.135
a 0.135
Name: btax_dollars, Length: 854, dtype: float64
Here, we extracted the same Series as with .loc
, but needed to specify the index positions of the rows/columns we were interested in rather than their labels. We passed in 0:854
to indicate that we were interested in the first 854 rows, and 4
to indicate that we were interested in the fifth column.
Similar to .loc
, we can pass in single values, lists or slices to the arguments of .iloc
. However, everything we pass in must be an integer (as we’re looking at the index positions). Another example is below.
df2.iloc[0:854,[0, 1, 4]]
st_name | year | btax_dollars | |
---|---|---|---|
a | AL | 1970 | 2.370 |
a | AL | 1971 | 2.370 |
a | AL | 1972 | 2.370 |
a | AL | 1973 | 2.370 |
a | AL | 1974 | 2.370 |
... | ... | ... | ... |
a | MO | 1993 | 0.135 |
a | MO | 1994 | 0.135 |
a | MO | 1995 | 0.135 |
a | MO | 1996 | 0.135 |
a | MO | 1997 | 0.135 |
854 rows × 3 columns
4. []
#
The []
is the hardest to describe, but the easiest to write and for many, the easiest to use. The []
simply follows the notation DataFrame_Name[...]
, where ...
can be a single column label (similar to .loc
), a list of column labels (similar to .loc
) or a slice of row index positions (similar to .iloc
). We’ve included examples for all three below. Do not worry if this seems confusing at first, feel free to use one of the first three methods instead.
Single Column Label#
df2['btax_dollars']
a 2.370
a 2.370
a 2.370
a 2.370
a 2.370
...
b 0.045
b 0.045
b 0.045
b 0.045
b 0.045
Name: btax_dollars, Length: 1708, dtype: float64
List of Column Labels#
df2[['st_name','year','btax_dollars']]
st_name | year | btax_dollars | |
---|---|---|---|
a | AL | 1970 | 2.370 |
a | AL | 1971 | 2.370 |
a | AL | 1972 | 2.370 |
a | AL | 1973 | 2.370 |
a | AL | 1974 | 2.370 |
... | ... | ... | ... |
b | WY | 1999 | 0.045 |
b | WY | 2000 | 0.045 |
b | WY | 2001 | 0.045 |
b | WY | 2002 | 0.045 |
b | WY | 2003 | 0.045 |
1708 rows × 3 columns
Slice of Row Index Positions#
df2[0:5]
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
a | AL | 1970 | 33098 | 72.341130 | 2.37 | 3450 | 4.0 |
a | AL | 1971 | 37598 | 69.304600 | 2.37 | 3497 | 4.0 |
a | AL | 1972 | 42719 | 67.149190 | 2.37 | 3539 | 4.0 |
a | AL | 1973 | 46203 | 63.217026 | 2.37 | 3580 | 4.0 |
a | AL | 1974 | 49769 | 56.933796 | 2.37 | 3627 | 4.0 |
In the next subchapter, we will talk about conditional selection, which allows us to view all rows that satisfy a certain condition.