import pandas as pd

Loading in and Looking at the Data#

Throughout this chapter, we will use pandas to explore a dataset on beer sales across 50 continental states in the US. Above, we have already imported in pandas as pd, a common nickname for pandas (similar to how numpy is often abbreviated to np).

The dataset we’re using is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009). The original data can be found at the paper’s replication package, and we have modified it for educational use. The dataset includes 7 columns:

  • st_name: the state abbreviation

  • year: the year the data was recorded

  • c_beer: the quantity of beer consumed, in thousands of gallons

  • beer_tax: the ad valorem tax, as a percentage

  • btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer

  • population: the population of the state, in thousands

  • salestax: the sales tax percentage

In this subchapter, we load in the dataset and learn about different ways of viewing the data. Let us begin by loading in the dataset.

Loading in the dataset#

Right now, we have the dataset stored as a comma-separated values (csv) file, in the location data/beer_tax.csv. We can load it in using panda’s read_csv method. Below, we load it in and set it equal to the variable df (short for DataFrame).

df = pd.read_csv('data/beer_tax.csv')
df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Whenever we use a dataset, we ought to think about where it came from, how it was generated and what it represents. Doing so is absolutely crucial for good data science. Doing so will help us keep our eyes open for potential sources of bias in the data, avoiding unfortunate mistakes. Additionally, if we do not think about where we source our data from, we cannot truly consider the ethical implications of our analysis.

Three important metrics you ought to consider with datasets are granularity, scope and temporality;

  • Granularity: the granularity of a dataset can best be summarized as what each individual row in the dataset represents. A dataset with finer granularity than another dataset has more detailed records. For example, a dataset where each row represents a residential neighborhood has coarser (the opposite of finer) granularity than a dataset where each row represents a house in a residential neighborhood.

  • Scope: the scope of the dataset is the subset of the population of interest that is covered by the data. For example, if you’re interested in alcohol information for all US states from 1970 until the present, the beer dataset we’re using has a scope of just beer data from 1970-2003.

  • Temporality: the temporality of the dataset describes the time element of the dataset. We encourage you to consider which time period the data is measuring, when the data was originally measured and when it was most recently updated. For example, our beer dataset spans from 1970-2003 and was last updated presumably sometime shortly before 2009 (when the paper was published), but we would need to read the paper and it’s appendix to determine when the data was first measured.

Finally, real world datasets are often messy. They have inaccuracies and mistakes. Whenever you obtain a dataset, the one of the first steps you ought to do is to see how ‘faithful’ the data is to the values it claims to describe. To do so, you should check for clearly incorrect values (e.g. a negative age value), possible typos from data entry, signs of data falsification (e.g. incorrect email addresses), unspecified or incorrect units, etc.

DataFrames, Series and Indices#

In pandas, all tables are stored as data structures called as DataFrames. Each DataFrame has an index (the bolded column on the left) and columns (called as Series).

While the index in our DataFrame ranges from 0 to 1707, this does not have to be the case. In an index, indices don’t have to be numbers and they don’t have to be unique. For example, let us create a DataFrame df2 with an index that contains neither numbers nor unique values. Instead, the index is simply a for the first 854 rows and b for the last 854 rows.

df2 = df.copy() # Making a copy of the DataFrame df
df2.index = ['a' for _ in range(int(len(df)/2))] + ['b' for _ in range(int(len(df)/2))]
df2 # The index is neither numerical nor unique
st_name year c_beer beer_tax btax_dollars population salestax
a AL 1970 33098 72.341130 2.370 3450 4.0
a AL 1971 37598 69.304600 2.370 3497 4.0
a AL 1972 42719 67.149190 2.370 3539 4.0
a AL 1973 46203 63.217026 2.370 3580 4.0
a AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
b WY 1999 12423 0.319894 0.045 492 4.0
b WY 2000 12595 0.309491 0.045 494 4.0
b WY 2001 12808 0.300928 0.045 494 4.0
b WY 2002 13191 0.296244 0.045 499 4.0
b WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Each Series is simply a 1D ordering of data, similar to a NumPy array. For example, we pull out the Series of btax_dollars in df in the cell below.

df['btax_dollars']
0       2.370
1       2.370
2       2.370
3       2.370
4       2.370
        ...  
1703    0.045
1704    0.045
1705    0.045
1706    0.045
1707    0.045
Name: btax_dollars, Length: 1708, dtype: float64

As you can see above, to pull out a singular Series from a DataFrame, we simply need to follow the notation DataFrame_Name[Series_Name].

However, if we’re instead interested in pulling out multiple Series from our DataFrame, we’re essentially interested in pulling out another, smaller DataFrame from the first DataFrame (since multiple Series = DataFrame). So, instead of passing in a single Series name into the notation df[...], we instead pass in a list containing the names of the Series we’re interested in. An example is below.

df[['st_name','year','btax_dollars']]
st_name year btax_dollars
0 AL 1970 2.370
1 AL 1971 2.370
2 AL 1972 2.370
3 AL 1973 2.370
4 AL 1974 2.370
... ... ... ...
1703 WY 1999 0.045
1704 WY 2000 0.045
1705 WY 2001 0.045
1706 WY 2002 0.045
1707 WY 2003 0.045

1708 rows × 3 columns

Series Attributes#

When we look at a Series, we get the index on the left and the corresponding values on the right. You can view the index via the attribute index and the values via the attribute values. Examples are shown below.

df['btax_dollars'].index 
# How can you interpret the index here? Hint: look at the prerequiste chapter on NumPy.
RangeIndex(start=0, stop=1708, step=1)
df['btax_dollars'].values
array([2.37 , 2.37 , 2.37 , ..., 0.045, 0.045, 0.045])

DataFrame Attributes#

Three important DataFrame attributes are index, columns, and shape. index simply returns the index of the dataset, as shown below.

df.index
RangeIndex(start=0, stop=1708, step=1)

columns returns the name of the columns/Series that make up the dataset. Note that each column/Series name must be unique.

df.columns
Index(['st_name', 'year', 'c_beer', 'beer_tax', 'btax_dollars', 'population',
       'salestax'],
      dtype='object')

shape returns the dimensions, or the number of rows and the number of columns (in that order), of the dataset.

df.shape
(1708, 7)

Moreover, size is another attributes that returns the number of rows times the number of columns, or the total number of datapoints, in the dataset.

df.size
11956

Finally, dtypes returns the data type of each column in the DataFrame.

df.dtypes
st_name          object
year              int64
c_beer            int64
beer_tax        float64
btax_dollars    float64
population        int64
salestax        float64
dtype: object

Note: the data type for the st_name column is object even though the column contains strings. Click here to learn why this happens.

Extracting Data using Labels and Indices#

There are four primary methods of looking at extracting specific rows/columns from DataFrames. They are :-

  • .head and .tail

  • .loc

  • .iloc

  • []

1. .head and .tail#

If we just want to select the first or last n rows from a dataset, we can use DataFrame_Name.head(n) (for the first n rows) or DataFrame_Name.tail(n) (for the last n rows).

df.head(5) # Extract the first 5 rows
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.37 3450 4.0
1 AL 1971 37598 69.304600 2.37 3497 4.0
2 AL 1972 42719 67.149190 2.37 3539 4.0
3 AL 1973 46203 63.217026 2.37 3580 4.0
4 AL 1974 49769 56.933796 2.37 3627 4.0
df.tail(5) # Extract the last 5 rows
st_name year c_beer beer_tax btax_dollars population salestax
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

2. .loc#

To better illustrate the difference between .loc and iloc, we will use the DataFrame df2, as created earlier. We’ve reposted it here for ease of access.

df2
st_name year c_beer beer_tax btax_dollars population salestax
a AL 1970 33098 72.341130 2.370 3450 4.0
a AL 1971 37598 69.304600 2.370 3497 4.0
a AL 1972 42719 67.149190 2.370 3539 4.0
a AL 1973 46203 63.217026 2.370 3580 4.0
a AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
b WY 1999 12423 0.319894 0.045 492 4.0
b WY 2000 12595 0.309491 0.045 494 4.0
b WY 2001 12808 0.300928 0.045 494 4.0
b WY 2002 13191 0.296244 0.045 499 4.0
b WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

.loc helps us get values based on the index/column labels associated with them. For example, let’s say we’re interested in all the btax_dollars values associated with an index label of a. We can use .loc to get all such values, as shown below.

df2.loc['a','btax_dollars'] 
a    2.370
a    2.370
a    2.370
a    2.370
a    2.370
     ...  
a    0.135
a    0.135
a    0.135
a    0.135
a    0.135
Name: btax_dollars, Length: 854, dtype: float64

Above, we first passed in the row labels (or index labels) we’re interested in, followed by the column labels we’re interested in. If we’re interested in multiple row/column labels, we must pass in a list of the labels we’re interested in. An example is shown below.

df2.loc['a',['st_name','year','btax_dollars']]
st_name year btax_dollars
a AL 1970 2.370
a AL 1971 2.370
a AL 1972 2.370
a AL 1973 2.370
a AL 1974 2.370
... ... ... ...
a MO 1993 0.135
a MO 1994 0.135
a MO 1995 0.135
a MO 1996 0.135
a MO 1997 0.135

854 rows × 3 columns

We can also use a slice (:) to specify that we’re interested in all the labels.

df2.loc[:,['st_name','year','btax_dollars']]
st_name year btax_dollars
a AL 1970 2.370
a AL 1971 2.370
a AL 1972 2.370
a AL 1973 2.370
a AL 1974 2.370
... ... ... ...
b WY 1999 0.045
b WY 2000 0.045
b WY 2001 0.045
b WY 2002 0.045
b WY 2003 0.045

1708 rows × 3 columns

3. .iloc#

.iloc works similarly to .loc, but it looks at the integers associated with each label rather than the label itself. For example, let’s extract all the btax_dollars values associated with an index label of a using .iloc.

df2.iloc[0:854,4]
a    2.370
a    2.370
a    2.370
a    2.370
a    2.370
     ...  
a    0.135
a    0.135
a    0.135
a    0.135
a    0.135
Name: btax_dollars, Length: 854, dtype: float64

Here, we extracted the same Series as with .loc, but needed to specify the index positions of the rows/columns we were interested in rather than their labels. We passed in 0:854 to indicate that we were interested in the first 854 rows, and 4 to indicate that we were interested in the fifth column.

Similar to .loc, we can pass in single values, lists or slices to the arguments of .iloc. However, everything we pass in must be an integer (as we’re looking at the index positions). Another example is below.

df2.iloc[0:854,[0, 1, 4]]
st_name year btax_dollars
a AL 1970 2.370
a AL 1971 2.370
a AL 1972 2.370
a AL 1973 2.370
a AL 1974 2.370
... ... ... ...
a MO 1993 0.135
a MO 1994 0.135
a MO 1995 0.135
a MO 1996 0.135
a MO 1997 0.135

854 rows × 3 columns

4. []#

The [] is the hardest to describe, but the easiest to write and for many, the easiest to use. The [] simply follows the notation DataFrame_Name[...], where ... can be a single column label (similar to .loc), a list of column labels (similar to .loc) or a slice of row index positions (similar to .iloc). We’ve included examples for all three below. Do not worry if this seems confusing at first, feel free to use one of the first three methods instead.

Single Column Label#

df2['btax_dollars']
a    2.370
a    2.370
a    2.370
a    2.370
a    2.370
     ...  
b    0.045
b    0.045
b    0.045
b    0.045
b    0.045
Name: btax_dollars, Length: 1708, dtype: float64

List of Column Labels#

df2[['st_name','year','btax_dollars']]
st_name year btax_dollars
a AL 1970 2.370
a AL 1971 2.370
a AL 1972 2.370
a AL 1973 2.370
a AL 1974 2.370
... ... ... ...
b WY 1999 0.045
b WY 2000 0.045
b WY 2001 0.045
b WY 2002 0.045
b WY 2003 0.045

1708 rows × 3 columns

Slice of Row Index Positions#

df2[0:5]
st_name year c_beer beer_tax btax_dollars population salestax
a AL 1970 33098 72.341130 2.37 3450 4.0
a AL 1971 37598 69.304600 2.37 3497 4.0
a AL 1972 42719 67.149190 2.37 3539 4.0
a AL 1973 46203 63.217026 2.37 3580 4.0
a AL 1974 49769 56.933796 2.37 3627 4.0

In the next subchapter, we will talk about conditional selection, which allows us to view all rows that satisfy a certain condition.