Loading in and Looking at the Data

Loading in and Looking at the Data#

Throughout this chapter, we will use pandas to explore a dataset on beer sales across 50 continental states in the US. Above, we have already imported in pandas as pd, a common nickname for pandas (similar to how numpy is often abbreviated to np).

The dataset we’re using is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009). The original data can be found at the paper’s replication package, and we have modified it for educational use. The dataset includes 7 columns:

st_name: the state abbreviation
year: the year the data was recorded
c_beer: the quantity of beer consumed, in thousands of gallons
beer_tax: the ad valorem tax, as a percentage
btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer
population: the population of the state, in thousands
salestax: the sales tax percentage

In this subchapter, we load in the dataset and learn about different ways of viewing the data. Let us begin by loading in the dataset.

Loading in the dataset#

Right now, we have the dataset stored as a comma-separated values (csv) file, in the location data/beer_tax.csv. We can load it in using panda’s read_csv method. Below, we load it in and set it equal to the variable df (short for DataFrame).

df = pd.read_csv('data/beer_tax.csv')
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.370	3450	4.0
1	AL	1971	37598	69.304600	2.370	3497	4.0
2	AL	1972	42719	67.149190	2.370	3539	4.0
3	AL	1973	46203	63.217026	2.370	3580	4.0
4	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Whenever we use a dataset, we ought to think about where it came from, how it was generated and what it represents. Doing so is absolutely crucial for good data science. Doing so will help us keep our eyes open for potential sources of bias in the data, avoiding unfortunate mistakes. Additionally, if we do not think about where we source our data from, we cannot truly consider the ethical implications of our analysis.

Three important metrics you ought to consider with datasets are granularity, scope and temporality;

Granularity: the granularity of a dataset can best be summarized as what each individual row in the dataset represents. A dataset with finer granularity than another dataset has more detailed records. For example, a dataset where each row represents a residential neighborhood has coarser (the opposite of finer) granularity than a dataset where each row represents a house in a residential neighborhood.
Scope: the scope of the dataset is the subset of the population of interest that is covered by the data. For example, if you’re interested in alcohol information for all US states from 1970 until the present, the beer dataset we’re using has a scope of just beer data from 1970-2003.
Temporality: the temporality of the dataset describes the time element of the dataset. We encourage you to consider which time period the data is measuring, when the data was originally measured and when it was most recently updated. For example, our beer dataset spans from 1970-2003 and was last updated presumably sometime shortly before 2009 (when the paper was published), but we would need to read the paper and it’s appendix to determine when the data was first measured.

Finally, real world datasets are often messy. They have inaccuracies and mistakes. Whenever you obtain a dataset, the one of the first steps you ought to do is to see how ‘faithful’ the data is to the values it claims to describe. To do so, you should check for clearly incorrect values (e.g. a negative age value), possible typos from data entry, signs of data falsification (e.g. incorrect email addresses), unspecified or incorrect units, etc.

DataFrames, Series and Indices#

In pandas, all tables are stored as data structures called as DataFrames. Each DataFrame has an index (the bolded column on the left) and columns (called as Series).

While the index in our DataFrame ranges from 0 to 1707, this does not have to be the case. In an index, indices don’t have to be numbers and they don’t have to be unique. For example, let us create a DataFrame df2 with an index that contains neither numbers nor unique values. Instead, the index is simply a for the first 854 rows and b for the last 854 rows.

df2 = df.copy() # Making a copy of the DataFrame df
df2.index = ['a' for _ in range(int(len(df)/2))] + ['b' for _ in range(int(len(df)/2))]
df2 # The index is neither numerical nor unique

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
a	AL	1970	33098	72.341130	2.370	3450	4.0
a	AL	1971	37598	69.304600	2.370	3497	4.0
a	AL	1972	42719	67.149190	2.370	3539	4.0
a	AL	1973	46203	63.217026	2.370	3580	4.0
a	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
b	WY	1999	12423	0.319894	0.045	492	4.0
b	WY	2000	12595	0.309491	0.045	494	4.0
b	WY	2001	12808	0.300928	0.045	494	4.0
b	WY	2002	13191	0.296244	0.045	499	4.0
b	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Each Series is simply a 1D ordering of data, similar to a NumPy array. For example, we pull out the Series of btax_dollars in df in the cell below.

df['btax_dollars']

     2.370
     2.370
     2.370
     2.370
     2.370
        ...  
  0.045
  0.045
  0.045
  0.045
  0.045
Name: btax_dollars, Length: 1708, dtype: float64

As you can see above, to pull out a singular Series from a DataFrame, we simply need to follow the notation DataFrame_Name[Series_Name].

However, if we’re instead interested in pulling out multiple Series from our DataFrame, we’re essentially interested in pulling out another, smaller DataFrame from the first DataFrame (since multiple Series = DataFrame). So, instead of passing in a single Series name into the notation df[...], we instead pass in a list containing the names of the Series we’re interested in. An example is below.

df[['st_name','year','btax_dollars']]

	st_name	year	btax_dollars
0	AL	1970	2.370
1	AL	1971	2.370
2	AL	1972	2.370
3	AL	1973	2.370
4	AL	1974	2.370
...	...	...	...
1703	WY	1999	0.045
1704	WY	2000	0.045
1705	WY	2001	0.045
1706	WY	2002	0.045
1707	WY	2003	0.045

1708 rows × 3 columns

Series Attributes#

When we look at a Series, we get the index on the left and the corresponding values on the right. You can view the index via the attribute index and the values via the attribute values. Examples are shown below.

df['btax_dollars'].index 
# How can you interpret the index here? Hint: look at the prerequiste chapter on NumPy.

RangeIndex(start=0, stop=1708, step=1)

df['btax_dollars'].values

array([2.37 , 2.37 , 2.37 , ..., 0.045, 0.045, 0.045])

DataFrame Attributes#

Three important DataFrame attributes are index, columns, and shape. index simply returns the index of the dataset, as shown below.

df.index

RangeIndex(start=0, stop=1708, step=1)

columns returns the name of the columns/Series that make up the dataset. Note that each column/Series name must be unique.

df.columns

Index(['st_name', 'year', 'c_beer', 'beer_tax', 'btax_dollars', 'population',
       'salestax'],
      dtype='object')

shape returns the dimensions, or the number of rows and the number of columns (in that order), of the dataset.

df.shape

(1708, 7)

Moreover, size is another attributes that returns the number of rows times the number of columns, or the total number of datapoints, in the dataset.

df.size

Finally, dtypes returns the data type of each column in the DataFrame.

df.dtypes

st_name          object
year              int64
c_beer            int64
beer_tax        float64
btax_dollars    float64
population        int64
salestax        float64
dtype: object

Note: the data type for the st_name column is object even though the column contains strings. Click here to learn why this happens.

Extracting Data using Labels and Indices#

There are four primary methods of looking at extracting specific rows/columns from DataFrames. They are :-

.head and .tail
.loc
.iloc
[]

1. `.head` and `.tail`#

If we just want to select the first or last n rows from a dataset, we can use DataFrame_Name.head(n) (for the first n rows) or DataFrame_Name.tail(n) (for the last n rows).

df.head(5) # Extract the first 5 rows

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.37	3450	4.0
1	AL	1971	37598	69.304600	2.37	3497	4.0
2	AL	1972	42719	67.149190	2.37	3539	4.0
3	AL	1973	46203	63.217026	2.37	3580	4.0
4	AL	1974	49769	56.933796	2.37	3627	4.0

df.tail(5) # Extract the last 5 rows

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

2. `.loc`#

To better illustrate the difference between .loc and iloc, we will use the DataFrame df2, as created earlier. We’ve reposted it here for ease of access.

df2

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
a	AL	1970	33098	72.341130	2.370	3450	4.0
a	AL	1971	37598	69.304600	2.370	3497	4.0
a	AL	1972	42719	67.149190	2.370	3539	4.0
a	AL	1973	46203	63.217026	2.370	3580	4.0
a	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
b	WY	1999	12423	0.319894	0.045	492	4.0
b	WY	2000	12595	0.309491	0.045	494	4.0
b	WY	2001	12808	0.300928	0.045	494	4.0
b	WY	2002	13191	0.296244	0.045	499	4.0
b	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

.loc helps us get values based on the index/column labels associated with them. For example, let’s say we’re interested in all the btax_dollars values associated with an index label of a. We can use .loc to get all such values, as shown below.

df2.loc['a','btax_dollars'] 

a    2.370
a    2.370
a    2.370
a    2.370
a    2.370
     ...  
a    0.135
a    0.135
a    0.135
a    0.135
a    0.135
Name: btax_dollars, Length: 854, dtype: float64

Above, we first passed in the row labels (or index labels) we’re interested in, followed by the column labels we’re interested in. If we’re interested in multiple row/column labels, we must pass in a list of the labels we’re interested in. An example is shown below.

df2.loc['a',['st_name','year','btax_dollars']]

	st_name	year	btax_dollars
a	AL	1970	2.370
a	AL	1971	2.370
a	AL	1972	2.370
a	AL	1973	2.370
a	AL	1974	2.370
...	...	...	...
a	MO	1993	0.135
a	MO	1994	0.135
a	MO	1995	0.135
a	MO	1996	0.135
a	MO	1997	0.135

854 rows × 3 columns

We can also use a slice (:) to specify that we’re interested in all the labels.

df2.loc[:,['st_name','year','btax_dollars']]

	st_name	year	btax_dollars
a	AL	1970	2.370
a	AL	1971	2.370
a	AL	1972	2.370
a	AL	1973	2.370
a	AL	1974	2.370
...	...	...	...
b	WY	1999	0.045
b	WY	2000	0.045
b	WY	2001	0.045
b	WY	2002	0.045
b	WY	2003	0.045

1708 rows × 3 columns

3. `.iloc`#

.iloc works similarly to .loc, but it looks at the integers associated with each label rather than the label itself. For example, let’s extract all the btax_dollars values associated with an index label of a using .iloc.

df2.iloc[0:854,4]

a    2.370
a    2.370
a    2.370
a    2.370
a    2.370
     ...  
a    0.135
a    0.135
a    0.135
a    0.135
a    0.135
Name: btax_dollars, Length: 854, dtype: float64

Here, we extracted the same Series as with .loc, but needed to specify the index positions of the rows/columns we were interested in rather than their labels. We passed in 0:854 to indicate that we were interested in the first 854 rows, and 4 to indicate that we were interested in the fifth column.

Similar to .loc, we can pass in single values, lists or slices to the arguments of .iloc. However, everything we pass in must be an integer (as we’re looking at the index positions). Another example is below.

df2.iloc[0:854,[0, 1, 4]]

	st_name	year	btax_dollars
a	AL	1970	2.370
a	AL	1971	2.370
a	AL	1972	2.370
a	AL	1973	2.370
a	AL	1974	2.370
...	...	...	...
a	MO	1993	0.135
a	MO	1994	0.135
a	MO	1995	0.135
a	MO	1996	0.135
a	MO	1997	0.135

854 rows × 3 columns

4. `[]`#

The [] is the hardest to describe, but the easiest to write and for many, the easiest to use. The [] simply follows the notation DataFrame_Name[...], where ... can be a single column label (similar to .loc), a list of column labels (similar to .loc) or a slice of row index positions (similar to .iloc). We’ve included examples for all three below. Do not worry if this seems confusing at first, feel free to use one of the first three methods instead.

Single Column Label#

df2['btax_dollars']

a    2.370
a    2.370
a    2.370
a    2.370
a    2.370
     ...  
b    0.045
b    0.045
b    0.045
b    0.045
b    0.045
Name: btax_dollars, Length: 1708, dtype: float64

List of Column Labels#

df2[['st_name','year','btax_dollars']]

	st_name	year	btax_dollars
a	AL	1970	2.370
a	AL	1971	2.370
a	AL	1972	2.370
a	AL	1973	2.370
a	AL	1974	2.370
...	...	...	...
b	WY	1999	0.045
b	WY	2000	0.045
b	WY	2001	0.045
b	WY	2002	0.045
b	WY	2003	0.045

1708 rows × 3 columns

Slice of Row Index Positions#

df2[0:5]

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
a	AL	1970	33098	72.341130	2.37	3450	4.0
a	AL	1971	37598	69.304600	2.37	3497	4.0
a	AL	1972	42719	67.149190	2.37	3539	4.0
a	AL	1973	46203	63.217026	2.37	3580	4.0
a	AL	1974	49769	56.933796	2.37	3627	4.0

In the next subchapter, we will talk about conditional selection, which allows us to view all rows that satisfy a certain condition.

Loading in and Looking at the Data

Contents

Loading in and Looking at the Data#

Loading in the dataset#

DataFrames, Series and Indices#

Series Attributes#

DataFrame Attributes#

Extracting Data using Labels and Indices#

1. .head and .tail#

2. .loc#

3. .iloc#

4. []#

Single Column Label#

List of Column Labels#

Slice of Row Index Positions#

1. `.head` and `.tail`#

2. `.loc`#

3. `.iloc`#

4. `[]`#