Helpful Pandas Methods

import pandas as pd
import numpy as np
import datetime

This subchapter goes over a lot of simple but extremely useful pandas methods that you will likely use a lot.

To start, let’s load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

st_name: the state abbreviation
year: the year the data was recorded
c_beer: the quantity of beer consumed, in thousands of gallons
beer_tax: the ad valorem tax, as a percentage
btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer
population: the population of the state, in thousands
salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df

Summary Statistics¶

To start, we will note that pandas is designed to work extremely well with NumPy. Most (if not all) NumPy functions can be applied on pandas DataFrames and Series. A few examples are shown below.

np.sum(df['year'])

3393246

np.mean(df[['year','c_beer','beer_tax']], axis = 0)

# What does axis = 0 represent? 
# Hint: read the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html

year          1986.677986
c_beer      108031.059719
beer_tax         6.525340
dtype: float64

df[['year','c_beer','beer_tax']].mean(axis = 0) # numpy.mean() and DataFrame.mean() are the same

year          1986.677986
c_beer      108031.059719
beer_tax         6.525340
dtype: float64

`describe`¶

.describe() helps us easily compute several statistics from a DataFrame.

df.describe()

Finally, there are a whole host of intuitive DataFrame methods that calculate statistics such as the average, standard deviation, median, percentiles, etc. While we cannot possibly describe them all here, we strongly encourage you to google around to learn more about them. Here is one link to get you started.

Sorting, Counting and Uniqueness¶

Here are some of the methods used to deal with sorting, counting and uniqueness problems. Again, this list is not meant to be comprehensive, and you are encouraged to google around for other methods you may find useful.

Sorting¶

As the name implies, .sort_values() can be used to sort DataFrames and Series! If sorting a DataFrame, you need to specify which column you need to sort by, but there is naturally no such restriction for sorting series. Examples are shown below; note the ascending parameter.

df.sort_values('population', ascending = False)

df['population'].sort_values(ascending = True)

34       304
35       316
36       324
37       331
38       341
       ...  
165    33499
166    34000
167    34533
168    35002
169    35484
Name: population, Length: 1708, dtype: int64

Counting¶

Often used for Series (although it can be used for DataFrames as well), the .value_counts() method counts the number of times a value (for Series) or row (for DataFrames) appears, and returns the values sorted from most common to least. For example,

df['st_name'].value_counts()

AL    34
VA    34
AK    34
MT    34
NE    34
NV    34
NH    34
NJ    34
NM    34
NY    34
NC    34
ND    34
OH    34
OK    34
OR    34
PA    34
RI    34
SC    34
SD    34
TN    34
TX    34
MS    34
MI    34
VT    34
MA    34
AZ    34
AR    34
CA    34
CO    34
CT    34
DE    34
DC    34
FL    34
GA    34
WI    34
ID    34
IL    34
IN    34
WV    34
WA    34
KY    34
LA    34
ME    34
MD    34
UT    34
MO    32
MN    32
KS    32
IA    32
WY    32
HI    18
Name: st_name, dtype: int64

df.value_counts()

st_name  year  c_beer  beer_tax   btax_dollars  population  salestax
AK       1970  5372    13.735660  0.5625        304         0.0         1
NY       1985  369214  1.362627   0.1238        17792       4.0         1
         1983  394026  1.376949   0.1158        17687       4.0         1
         1982  400138  1.227274   0.1000        17590       4.0         1
         1981  395584  1.302881   0.1000        17568       4.0         1
                                                                       ..
KY       1978  64414   3.296839   0.1815        3610        5.0         1
         1977  70755   3.547095   0.1815        3574        5.0         1
         1976  59481   3.777749   0.1815        3529        5.0         1
         1975  60230   3.995426   0.1815        3468        5.0         1
WY       2003  15535   0.289643   0.0450        501         4.0         1
Length: 1708, dtype: int64

Unique Values and Duplicates¶

If we’re interested in just obtaining the unique values from a Series, we can use the .unique() method.

df['st_name'].unique()

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

Alternatively, if we want all the duplicated values in a series or duplicated rows in a DataFrame, we can use .duplicated().

df['st_name'].duplicated()

0       False
1        True
2        True
3        True
4        True
        ...  
1703     True
1704     True
1705     True
1706     True
1707     True
Name: st_name, Length: 1708, dtype: bool

df.duplicated() # Our dataframe has no duplicate rows!

0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Length: 1708, dtype: bool

If we want to drop all the duplicated values, we can use .drop_duplicates(). For a series, this will return all non-duplicated values and for a DataFrame, it will return all non-duplicated rows.

df['st_name'].drop_duplicates()

0       AL
34      AK
68      AZ
102     AR
136     CA
170     CO
204     CT
238     DE
272     DC
306     FL
340     GA
374     HI
392     ID
426     IL
460     IN
494     IA
526     KS
558     KY
592     LA
626     ME
660     MD
694     MA
728     MI
762     MN
794     MS
828     MO
860     MT
894     NE
928     NV
962     NH
996     NJ
1030    NM
1064    NY
1098    NC
1132    ND
1166    OH
1200    OK
1234    OR
1268    PA
1302    RI
1336    SC
1370    SD
1404    TN
1438    TX
1472    UT
1506    VT
1540    VA
1574    WA
1608    WV
1642    WI
1676    WY
Name: st_name, dtype: object

df.drop_duplicates() # We don't drop anything as we don't have duplicates

Missing Values¶

Missing values are often stored as NaN values in our dataset. While our dataset doesn’t have any missing values, we could use .isna() to check for missing values.

df.isna()

df.isna().any(axis = 0) # See which columns have NaN values
# Hint: read the documentation for df.any(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html

st_name         False
year            False
c_beer          False
beer_tax        False
btax_dollars    False
population      False
salestax        False
dtype: bool

df.isna().any(axis = 1) # See which rows have NaN values

0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Length: 1708, dtype: bool

print(df.isna().any(axis = 0).any()) #These can be used to check if there are NaN's anywhere in the data
print(df.isna().any(axis = 1).any())

False
False

.isna() also works for Series.

df['st_name'].isna()

0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

df['st_name'].isna().any()

False

String Methods¶

String methods are a great tool for working with string data. They only work on Series. To use them, you add a .str after your Series to indicate that you want to use a string method, followed by the method you want to use. Some of the various string methods are given below, but again this list is not comprehensive.

`len`¶

Gives the length of the strings.

df['st_name'].str.len()

0       2
1       2
2       2
3       2
4       2
       ..
1703    2
1704    2
1705    2
1706    2
1707    2
Name: st_name, Length: 1708, dtype: int64

String slicing¶

Like list slicing, but for strings. Review the Python prereqs if you’re unsure.

df['st_name'].str[0]

0       A
1       A
2       A
3       A
4       A
       ..
1703    W
1704    W
1705    W
1706    W
1707    W
Name: st_name, Length: 1708, dtype: object

`contains`¶

Returns if a string contains the given substring.

df['st_name'].str.contains('A')

0        True
1        True
2        True
3        True
4        True
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

`replace`¶

Replaces a substring with the other substring.

df['st_name'].str.replace('A','a')

0       aL
1       aL
2       aL
3       aL
4       aL
        ..
1703    WY
1704    WY
1705    WY
1706    WY
1707    WY
Name: st_name, Length: 1708, dtype: object

`upper` and `lower`¶

Capitalizes or uncapitalizes a Series.

df['st_name'].str.upper()

0       AL
1       AL
2       AL
3       AL
4       AL
        ..
1703    WY
1704    WY
1705    WY
1706    WY
1707    WY
Name: st_name, Length: 1708, dtype: object

df['st_name'].str.lower()

0       al
1       al
2       al
3       al
4       al
        ..
1703    wy
1704    wy
1705    wy
1706    wy
1707    wy
Name: st_name, Length: 1708, dtype: object

`startswith` and `endswith`¶

Returns if a string starts with or ends with the given substring.

df['st_name'].str.startswith('A')

0        True
1        True
2        True
3        True
4        True
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

df['st_name'].str.endswith('A')

0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

`split`¶

Splits the strings on the given delimiter. Performs no splits if the delimiter doesn’t exist in the string. Returns a list of all the split substrings. Best understood by looking at an example.

samp_series = df.loc[[68, 102, 1574, 592, 1302, 1642],'st_name']
samp_series

68      AZ
102     AR
1574    WA
592     LA
1302    RI
1642    WI
Name: st_name, dtype: object

samp_series.str.split('A')

68      [, Z]
102     [, R]
1574    [W, ]
592     [L, ]
1302     [RI]
1642     [WI]
Name: st_name, dtype: object

As you can see above, all the strings have been split on the letter A. Strings which don’t have the letter A weren’t split.

If we want to return the output as a DataFrame, we can set the expand parameter to be True, as shown below.

samp_series.str.split('A', expand = True)

DateTime Objects¶

DateTime objects are incredibly helpful for working with dates. They help us easily perform calculations like add/subtract days/dates, convert timezones, sort dates, etc. We can convert a Series to a DateTime object by using pd.to_datetime(), as done below.

pd.to_datetime(df['year'], format='%Y')

0      1970-01-01
1      1971-01-01
2      1972-01-01
3      1973-01-01
4      1974-01-01
          ...    
1703   1999-01-01
1704   2000-01-01
1705   2001-01-01
1706   2002-01-01
1707   2003-01-01
Name: year, Length: 1708, dtype: datetime64[ns]

Notice how we specify the format of our dates to ensure the conversion happens correctly. We don’t always need to do this (especially if we have the year, month and day in a well-known format), but it is good practice to ensure the conversion happens correctly.

Let us convert the year column in the DataFrame to a DateTime object and perform some calculations

df['year'] = pd.to_datetime(df['year'], format='%Y')
df

Notice how the dates all default to 1st January. Let’s say we know that the measurements were actually done 4 days after the given date for each year. We can implement this in our DataFrame.

df['year'] = df['year'] + pd.DateOffset(days=4)
df

Alternatively, if we instead know that the measurement was always taken on June 5th, we can specify that as well. Feel free to ignore the PerformanceWarning for now, but you can read more here if you’re interested.

df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
df

/var/folders/td/dtym05z13119cjklqb5tyd040000gn/T/ipykernel_89001/3513634716.py:1: PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex.
  df['year'] = df['year'] + pd.DateOffset(month=6, day=5)

Note the difference between the day and the days parameter; days adds to the current day while day replaces the current day. Similar differences exist for month vs months, year vs years, etc.

Now, let’s look at how a datapoint is stored.

df.loc[0,'year']

Timestamp('1970-06-05 00:00:00')

In the 20th index, the year is 1990. Let’s subtract the dates.

df.loc[20,'year'] - df.loc[0,'year']

Timedelta('7305 days 00:00:00')

Interesting, we can also subtract the year’s directly.

df.loc[20,'year'].year - df.loc[0,'year'].year

20

Let’s also try subtracting the months.

df.loc[20,'year'].month - df.loc[0,'year'].month

0

The months are subtracted directly. As both of these months were set to June, there is no difference. Instead, to see the number of months between two dates, we must first subtract the years, multiply that by 12, and then subtract the months and add that. The calculation has been done below.

(df.loc[20,'year'].year - df.loc[0,'year'].year) * 12 + (df.loc[20,'year'].month - df.loc[0,'year'].month)

240

Finally, using DateTime objects also helps us sort dates easily!

df.sort_values('year')

References¶

Chetty, R., Looney, A., & Kroft, K. (2009). Salience and Taxation: Theory and Evidence. American Economic Review, 99(4), 1145–1177. 10.1257/aer.99.4.1145

Summary Statistics¶

describe¶