import pandas as pd
import numpy as np
import datetime

Helpful Pandas Methods#

This subchapter goes over a lot of simple but extremely useful pandas methods that you will likely use a lot.

To start, let’s load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

  • st_name: the state abbreviation

  • year: the year the data was recorded

  • c_beer: the quantity of beer consumed, in thousands of gallons

  • beer_tax: the ad valorem tax, as a percentage

  • btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer

  • population: the population of the state, in thousands

  • salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Summary Statistics#

To start, we will note that pandas is designed to work extremely well with NumPy. Most (if not all) NumPy functions can be applied on pandas DataFrames and Series. A few examples are shown below.

np.sum(df['year'])
3393246
np.mean(df[['year','c_beer','beer_tax']], axis = 0)

# What does axis = 0 represent? 
# Hint: read the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html
year          1986.677986
c_beer      108031.059719
beer_tax         6.525340
dtype: float64
df[['year','c_beer','beer_tax']].mean(axis = 0) # numpy.mean() and DataFrame.mean() are the same
year          1986.677986
c_beer      108031.059719
beer_tax         6.525340
dtype: float64

describe#

.describe() helps us easily compute several statistics from a DataFrame.

df.describe()
year c_beer beer_tax btax_dollars population salestax
count 1708.000000 1708.000000 1708.000000 1708.000000 1708.000000 1708.000000
mean 1986.677986 108031.059719 6.525340 0.524762 4837.031030 4.288511
std 9.761205 119280.769185 8.147708 0.523300 5275.527054 1.909266
min 1970.000000 5372.000000 0.289643 0.045000 304.000000 0.000000
25% 1978.000000 27886.750000 2.301018 0.188700 1219.000000 3.312250
50% 1987.000000 68306.000000 3.879157 0.360000 3303.000000 4.462500
75% 1995.000000 128838.500000 7.042739 0.603875 5703.250000 5.106250
max 2003.000000 691050.000000 72.341130 2.407500 35484.000000 9.000000

Finally, there are a whole host of intuitive DataFrame methods that calculate statistics such as the average, standard deviation, median, percentiles, etc. While we cannot possibly describe them all here, we strongly encourage you to google around to learn more about them. Here is one link to get you started.

Sorting, Counting and Uniqueness#

Here are some of the methods used to deal with sorting, counting and uniqueness problems. Again, this list is not meant to be comprehensive, and you are encouraged to google around for other methods you may find useful.

Sorting#

As the name implies, .sort_values() can be used to sort DataFrames and Series! If sorting a DataFrame, you need to specify which column you need to sort by, but there is naturally no such restriction for sorting series. Examples are shown below; note the ascending parameter.

df.sort_values('population', ascending = False)
st_name year c_beer beer_tax btax_dollars population salestax
169 CA 2003 663750 2.896433 0.4500 35484 7.25
168 CA 2002 659475 2.962444 0.4500 35002 7.25
167 CA 2001 652950 3.009280 0.4500 34533 7.00
166 CA 2000 637080 3.094911 0.4500 34000 7.25
165 CA 1999 638017 3.198941 0.4500 33499 7.25
... ... ... ... ... ... ... ...
38 AK 1974 7598 10.810215 0.5625 341 0.00
37 AK 1973 6453 12.003234 0.5625 331 0.00
36 AK 1972 6038 12.749847 0.5625 324 0.00
35 AK 1971 6336 13.159102 0.5625 316 0.00
34 AK 1970 5372 13.735660 0.5625 304 0.00

1708 rows × 7 columns

df['population'].sort_values(ascending = True)
34       304
35       316
36       324
37       331
38       341
       ...  
165    33499
166    34000
167    34533
168    35002
169    35484
Name: population, Length: 1708, dtype: int64

Counting#

Often used for Series (although it can be used for DataFrames as well), the .value_counts() method counts the number of times a value (for Series) or row (for DataFrames) appears, and returns the values sorted from most common to least. For example,

df['st_name'].value_counts()
AL    34
VA    34
AK    34
MT    34
NE    34
NV    34
NH    34
NJ    34
NM    34
NY    34
NC    34
ND    34
OH    34
OK    34
OR    34
PA    34
RI    34
SC    34
SD    34
TN    34
TX    34
MS    34
MI    34
VT    34
MA    34
AZ    34
AR    34
CA    34
CO    34
CT    34
DE    34
DC    34
FL    34
GA    34
WI    34
ID    34
IL    34
IN    34
WV    34
WA    34
KY    34
LA    34
ME    34
MD    34
UT    34
MO    32
MN    32
KS    32
IA    32
WY    32
HI    18
Name: st_name, dtype: int64
df.value_counts()
st_name  year  c_beer  beer_tax   btax_dollars  population  salestax
AK       1970  5372    13.735660  0.5625        304         0.0         1
NY       1985  369214  1.362627   0.1238        17792       4.0         1
         1983  394026  1.376949   0.1158        17687       4.0         1
         1982  400138  1.227274   0.1000        17590       4.0         1
         1981  395584  1.302881   0.1000        17568       4.0         1
                                                                       ..
KY       1978  64414   3.296839   0.1815        3610        5.0         1
         1977  70755   3.547095   0.1815        3574        5.0         1
         1976  59481   3.777749   0.1815        3529        5.0         1
         1975  60230   3.995426   0.1815        3468        5.0         1
WY       2003  15535   0.289643   0.0450        501         4.0         1
Length: 1708, dtype: int64

Unique Values and Duplicates#

If we’re interested in just obtaining the unique values from a Series, we can use the .unique() method.

df['st_name'].unique()
array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

Alternatively, if we want all the duplicated values in a series or duplicated rows in a DataFrame, we can use .duplicated().

df['st_name'].duplicated()
0       False
1        True
2        True
3        True
4        True
        ...  
1703     True
1704     True
1705     True
1706     True
1707     True
Name: st_name, Length: 1708, dtype: bool
df.duplicated() # Our dataframe has no duplicate rows!
0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Length: 1708, dtype: bool

If we want to drop all the duplicated values, we can use .drop_duplicates(). For a series, this will return all non-duplicated values and for a DataFrame, it will return all non-duplicated rows.

df['st_name'].drop_duplicates()
0       AL
34      AK
68      AZ
102     AR
136     CA
170     CO
204     CT
238     DE
272     DC
306     FL
340     GA
374     HI
392     ID
426     IL
460     IN
494     IA
526     KS
558     KY
592     LA
626     ME
660     MD
694     MA
728     MI
762     MN
794     MS
828     MO
860     MT
894     NE
928     NV
962     NH
996     NJ
1030    NM
1064    NY
1098    NC
1132    ND
1166    OH
1200    OK
1234    OR
1268    PA
1302    RI
1336    SC
1370    SD
1404    TN
1438    TX
1472    UT
1506    VT
1540    VA
1574    WA
1608    WV
1642    WI
1676    WY
Name: st_name, dtype: object
df.drop_duplicates() # We don't drop anything as we don't have duplicates
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970 33098 72.341130 2.370 3450 4.0
1 AL 1971 37598 69.304600 2.370 3497 4.0
2 AL 1972 42719 67.149190 2.370 3539 4.0
3 AL 1973 46203 63.217026 2.370 3580 4.0
4 AL 1974 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999 12423 0.319894 0.045 492 4.0
1704 WY 2000 12595 0.309491 0.045 494 4.0
1705 WY 2001 12808 0.300928 0.045 494 4.0
1706 WY 2002 13191 0.296244 0.045 499 4.0
1707 WY 2003 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Missing Values#

Missing values are often stored as NaN values in our dataset. While our dataset doesn’t have any missing values, we could use .isna() to check for missing values.

df.isna()
st_name year c_beer beer_tax btax_dollars population salestax
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
... ... ... ... ... ... ... ...
1703 False False False False False False False
1704 False False False False False False False
1705 False False False False False False False
1706 False False False False False False False
1707 False False False False False False False

1708 rows × 7 columns

df.isna().any(axis = 0) # See which columns have NaN values
# Hint: read the documentation for df.any(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html
st_name         False
year            False
c_beer          False
beer_tax        False
btax_dollars    False
population      False
salestax        False
dtype: bool
df.isna().any(axis = 1) # See which rows have NaN values
0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Length: 1708, dtype: bool
print(df.isna().any(axis = 0).any()) #These can be used to check if there are NaN's anywhere in the data
print(df.isna().any(axis = 1).any()) 
False
False

.isna() also works for Series.

df['st_name'].isna()
0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool
df['st_name'].isna().any()
False

String Methods#

String methods are a great tool for working with string data. They only work on Series. To use them, you add a .str after your Series to indicate that you want to use a string method, followed by the method you want to use. Some of the various string methods are given below, but again this list is not comprehensive.

len#

Gives the length of the strings.

df['st_name'].str.len()
0       2
1       2
2       2
3       2
4       2
       ..
1703    2
1704    2
1705    2
1706    2
1707    2
Name: st_name, Length: 1708, dtype: int64

String slicing#

Like list slicing, but for strings. Review the Python prereqs if you’re unsure.

df['st_name'].str[0]
0       A
1       A
2       A
3       A
4       A
       ..
1703    W
1704    W
1705    W
1706    W
1707    W
Name: st_name, Length: 1708, dtype: object

contains#

Returns if a string contains the given substring.

df['st_name'].str.contains('A')
0        True
1        True
2        True
3        True
4        True
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

replace#

Replaces a substring with the other substring.

df['st_name'].str.replace('A','a')
0       aL
1       aL
2       aL
3       aL
4       aL
        ..
1703    WY
1704    WY
1705    WY
1706    WY
1707    WY
Name: st_name, Length: 1708, dtype: object

upper and lower#

Capitalizes or uncapitalizes a Series.

df['st_name'].str.upper()
0       AL
1       AL
2       AL
3       AL
4       AL
        ..
1703    WY
1704    WY
1705    WY
1706    WY
1707    WY
Name: st_name, Length: 1708, dtype: object
df['st_name'].str.lower()
0       al
1       al
2       al
3       al
4       al
        ..
1703    wy
1704    wy
1705    wy
1706    wy
1707    wy
Name: st_name, Length: 1708, dtype: object

startswith and endswith#

Returns if a string starts with or ends with the given substring.

df['st_name'].str.startswith('A')
0        True
1        True
2        True
3        True
4        True
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool
df['st_name'].str.endswith('A')
0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

split#

Splits the strings on the given delimiter. Performs no splits if the delimiter doesn’t exist in the string. Returns a list of all the split substrings. Best understood by looking at an example.

samp_series = df.loc[[68, 102, 1574, 592, 1302, 1642],'st_name']
samp_series
68      AZ
102     AR
1574    WA
592     LA
1302    RI
1642    WI
Name: st_name, dtype: object
samp_series.str.split('A')
68      [, Z]
102     [, R]
1574    [W, ]
592     [L, ]
1302     [RI]
1642     [WI]
Name: st_name, dtype: object

As you can see above, all the strings have been split on the letter A. Strings which don’t have the letter A weren’t split.

If we want to return the output as a DataFrame, we can set the expand parameter to be True, as shown below.

samp_series.str.split('A', expand = True)
0 1
68 Z
102 R
1574 W
592 L
1302 RI None
1642 WI None

DateTime Objects#

DateTime objects are incredibly helpful for working with dates. They help us easily perform calculations like add/subtract days/dates, convert timezones, sort dates, etc. We can convert a Series to a DateTime object by using pd.to_datetime(), as done below.

pd.to_datetime(df['year'], format='%Y')
0      1970-01-01
1      1971-01-01
2      1972-01-01
3      1973-01-01
4      1974-01-01
          ...    
1703   1999-01-01
1704   2000-01-01
1705   2001-01-01
1706   2002-01-01
1707   2003-01-01
Name: year, Length: 1708, dtype: datetime64[ns]

Notice how we specify the format of our dates to ensure the conversion happens correctly. We don’t always need to do this (especially if we have the year, month and day in a well-known format), but it is good practice to ensure the conversion happens correctly.

Let us convert the year column in the DataFrame to a DateTime object and perform some calculations

df['year'] = pd.to_datetime(df['year'], format='%Y')
df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970-01-01 33098 72.341130 2.370 3450 4.0
1 AL 1971-01-01 37598 69.304600 2.370 3497 4.0
2 AL 1972-01-01 42719 67.149190 2.370 3539 4.0
3 AL 1973-01-01 46203 63.217026 2.370 3580 4.0
4 AL 1974-01-01 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999-01-01 12423 0.319894 0.045 492 4.0
1704 WY 2000-01-01 12595 0.309491 0.045 494 4.0
1705 WY 2001-01-01 12808 0.300928 0.045 494 4.0
1706 WY 2002-01-01 13191 0.296244 0.045 499 4.0
1707 WY 2003-01-01 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Notice how the dates all default to 1st January. Let’s say we know that the measurements were actually done 4 days after the given date for each year. We can implement this in our DataFrame.

df['year'] = df['year'] + pd.DateOffset(days=4)
df
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970-01-05 33098 72.341130 2.370 3450 4.0
1 AL 1971-01-05 37598 69.304600 2.370 3497 4.0
2 AL 1972-01-05 42719 67.149190 2.370 3539 4.0
3 AL 1973-01-05 46203 63.217026 2.370 3580 4.0
4 AL 1974-01-05 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999-01-05 12423 0.319894 0.045 492 4.0
1704 WY 2000-01-05 12595 0.309491 0.045 494 4.0
1705 WY 2001-01-05 12808 0.300928 0.045 494 4.0
1706 WY 2002-01-05 13191 0.296244 0.045 499 4.0
1707 WY 2003-01-05 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Alternatively, if we instead know that the measurement was always taken on June 5th, we can specify that as well. Feel free to ignore the PerformanceWarning for now, but you can read more here if you’re interested.

df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
df
/var/folders/td/dtym05z13119cjklqb5tyd040000gn/T/ipykernel_89001/3513634716.py:1: PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex.
  df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970-06-05 33098 72.341130 2.370 3450 4.0
1 AL 1971-06-05 37598 69.304600 2.370 3497 4.0
2 AL 1972-06-05 42719 67.149190 2.370 3539 4.0
3 AL 1973-06-05 46203 63.217026 2.370 3580 4.0
4 AL 1974-06-05 49769 56.933796 2.370 3627 4.0
... ... ... ... ... ... ... ...
1703 WY 1999-06-05 12423 0.319894 0.045 492 4.0
1704 WY 2000-06-05 12595 0.309491 0.045 494 4.0
1705 WY 2001-06-05 12808 0.300928 0.045 494 4.0
1706 WY 2002-06-05 13191 0.296244 0.045 499 4.0
1707 WY 2003-06-05 15535 0.289643 0.045 501 4.0

1708 rows × 7 columns

Note the difference between the day and the days parameter; days adds to the current day while day replaces the current day. Similar differences exist for month vs months, year vs years, etc.

Now, let’s look at how a datapoint is stored.

df.loc[0,'year']
Timestamp('1970-06-05 00:00:00')

In the 20th index, the year is 1990. Let’s subtract the dates.

df.loc[20,'year'] - df.loc[0,'year']
Timedelta('7305 days 00:00:00')

Interesting, we can also subtract the year’s directly.

df.loc[20,'year'].year - df.loc[0,'year'].year
20

Let’s also try subtracting the months.

df.loc[20,'year'].month - df.loc[0,'year'].month
0

The months are subtracted directly. As both of these months were set to June, there is no difference. Instead, to see the number of months between two dates, we must first subtract the years, multiply that by 12, and then subtract the months and add that. The calculation has been done below.

(df.loc[20,'year'].year - df.loc[0,'year'].year) * 12 + (df.loc[20,'year'].month - df.loc[0,'year'].month)
240

Finally, using DateTime objects also helps us sort dates easily!

df.sort_values('year')
st_name year c_beer beer_tax btax_dollars population salestax
0 AL 1970-06-05 33098 72.341130 2.3700 3450 4.00
170 CO 1970-06-05 42145 4.120698 0.1350 2224 3.00
592 LA 1970-06-05 68578 25.456755 0.8340 3650 2.00
1302 RI 1970-06-05 20360 5.686564 0.1863 951 5.00
860 MT 1970-06-05 17179 6.648059 0.2178 697 0.00
... ... ... ... ... ... ... ...
135 AR 2003-06-05 52045 3.507258 0.5449 2726 5.13
893 MT 2003-06-05 26100 2.008837 0.3121 918 0.00
1471 TX 2003-06-05 558837 2.867468 0.4455 22119 6.25
1675 WI 2003-06-05 151000 0.934582 0.1452 5472 5.00
1707 WY 2003-06-05 15535 0.289643 0.0450 501 4.00

1708 rows × 7 columns