Helpful Pandas Methods

Helpful Pandas Methods#

This subchapter goes over a lot of simple but extremely useful pandas methods that you will likely use a lot.

To start, let’s load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

st_name: the state abbreviation
year: the year the data was recorded
c_beer: the quantity of beer consumed, in thousands of gallons
beer_tax: the ad valorem tax, as a percentage
btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer
population: the population of the state, in thousands
salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.370	3450	4.0
1	AL	1971	37598	69.304600	2.370	3497	4.0
2	AL	1972	42719	67.149190	2.370	3539	4.0
3	AL	1973	46203	63.217026	2.370	3580	4.0
4	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Summary Statistics#

To start, we will note that pandas is designed to work extremely well with NumPy. Most (if not all) NumPy functions can be applied on pandas DataFrames and Series. A few examples are shown below.

np.sum(df['year'])

np.mean(df[['year','c_beer','beer_tax']], axis = 0)

# What does axis = 0 represent? 
# Hint: read the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html

year          1986.677986
c_beer      108031.059719
beer_tax         6.525340
dtype: float64

df[['year','c_beer','beer_tax']].mean(axis = 0) # numpy.mean() and DataFrame.mean() are the same

year          1986.677986
c_beer      108031.059719
beer_tax         6.525340
dtype: float64

`describe`#

.describe() helps us easily compute several statistics from a DataFrame.

df.describe()

	year	c_beer	beer_tax	btax_dollars	population	salestax
count	1708.000000	1708.000000	1708.000000	1708.000000	1708.000000	1708.000000
mean	1986.677986	108031.059719	6.525340	0.524762	4837.031030	4.288511
std	9.761205	119280.769185	8.147708	0.523300	5275.527054	1.909266
min	1970.000000	5372.000000	0.289643	0.045000	304.000000	0.000000
25%	1978.000000	27886.750000	2.301018	0.188700	1219.000000	3.312250
50%	1987.000000	68306.000000	3.879157	0.360000	3303.000000	4.462500
75%	1995.000000	128838.500000	7.042739	0.603875	5703.250000	5.106250
max	2003.000000	691050.000000	72.341130	2.407500	35484.000000	9.000000

Finally, there are a whole host of intuitive DataFrame methods that calculate statistics such as the average, standard deviation, median, percentiles, etc. While we cannot possibly describe them all here, we strongly encourage you to google around to learn more about them. Here is one link to get you started.

Sorting, Counting and Uniqueness#

Here are some of the methods used to deal with sorting, counting and uniqueness problems. Again, this list is not meant to be comprehensive, and you are encouraged to google around for other methods you may find useful.

Sorting#

As the name implies, .sort_values() can be used to sort DataFrames and Series! If sorting a DataFrame, you need to specify which column you need to sort by, but there is naturally no such restriction for sorting series. Examples are shown below; note the ascending parameter.

df.sort_values('population', ascending = False)

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
169	CA	2003	663750	2.896433	0.4500	35484	7.25
168	CA	2002	659475	2.962444	0.4500	35002	7.25
167	CA	2001	652950	3.009280	0.4500	34533	7.00
166	CA	2000	637080	3.094911	0.4500	34000	7.25
165	CA	1999	638017	3.198941	0.4500	33499	7.25
...	...	...	...	...	...	...	...
38	AK	1974	7598	10.810215	0.5625	341	0.00
37	AK	1973	6453	12.003234	0.5625	331	0.00
36	AK	1972	6038	12.749847	0.5625	324	0.00
35	AK	1971	6336	13.159102	0.5625	316	0.00
34	AK	1970	5372	13.735660	0.5625	304	0.00

1708 rows × 7 columns

df['population'].sort_values(ascending = True)

     304
     316
     324
     331
     341
       ...  
  33499
  34000
  34533
  35002
  35484
Name: population, Length: 1708, dtype: int64

Counting#

Often used for Series (although it can be used for DataFrames as well), the .value_counts() method counts the number of times a value (for Series) or row (for DataFrames) appears, and returns the values sorted from most common to least. For example,

df['st_name'].value_counts()

AL    34
VA    34
AK    34
MT    34
NE    34
NV    34
NH    34
NJ    34
NM    34
NY    34
NC    34
ND    34
OH    34
OK    34
OR    34
PA    34
RI    34
SC    34
SD    34
TN    34
TX    34
MS    34
MI    34
VT    34
MA    34
AZ    34
AR    34
CA    34
CO    34
CT    34
DE    34
DC    34
FL    34
GA    34
WI    34
ID    34
IL    34
IN    34
WV    34
WA    34
KY    34
LA    34
ME    34
MD    34
UT    34
MO    32
MN    32
KS    32
IA    32
WY    32
HI    18
Name: st_name, dtype: int64

df.value_counts()

st_name  year  c_beer  beer_tax   btax_dollars  population  salestax
AK       1970  5372    13.735660  0.5625        304         0.0         1
NY       1985  369214  1.362627   0.1238        17792       4.0         1
         1983  394026  1.376949   0.1158        17687       4.0         1
         1982  400138  1.227274   0.1000        17590       4.0         1
         1981  395584  1.302881   0.1000        17568       4.0         1
                                                                       ..
KY       1978  64414   3.296839   0.1815        3610        5.0         1
         1977  70755   3.547095   0.1815        3574        5.0         1
         1976  59481   3.777749   0.1815        3529        5.0         1
         1975  60230   3.995426   0.1815        3468        5.0         1
WY       2003  15535   0.289643   0.0450        501         4.0         1
Length: 1708, dtype: int64

Unique Values and Duplicates#

If we’re interested in just obtaining the unique values from a Series, we can use the .unique() method.

df['st_name'].unique()

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

Alternatively, if we want all the duplicated values in a series or duplicated rows in a DataFrame, we can use .duplicated().

df['st_name'].duplicated()

     False
      True
      True
      True
      True
        ...  
   True
   True
   True
   True
   True
Name: st_name, Length: 1708, dtype: bool

df.duplicated() # Our dataframe has no duplicate rows!

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
  False
Length: 1708, dtype: bool

If we want to drop all the duplicated values, we can use .drop_duplicates(). For a series, this will return all non-duplicated values and for a DataFrame, it will return all non-duplicated rows.

df['st_name'].drop_duplicates()

     AL
    AK
    AZ
   AR
   CA
   CO
   CT
   DE
   DC
   FL
   GA
   HI
   ID
   IL
   IN
   IA
   KS
   KY
   LA
   ME
   MD
   MA
   MI
   MN
   MS
   MO
   MT
   NE
   NV
   NH
   NJ
  NM
  NY
  NC
  ND
  OH
  OK
  OR
  PA
  RI
  SC
  SD
  TN
  TX
  UT
  VT
  VA
  WA
  WV
  WI
  WY
Name: st_name, dtype: object

df.drop_duplicates() # We don't drop anything as we don't have duplicates

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.370	3450	4.0
1	AL	1971	37598	69.304600	2.370	3497	4.0
2	AL	1972	42719	67.149190	2.370	3539	4.0
3	AL	1973	46203	63.217026	2.370	3580	4.0
4	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Missing Values#

Missing values are often stored as NaN values in our dataset. While our dataset doesn’t have any missing values, we could use .isna() to check for missing values.

df.isna()

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...
1703	False	False	False	False	False	False	False
1704	False	False	False	False	False	False	False
1705	False	False	False	False	False	False	False
1706	False	False	False	False	False	False	False
1707	False	False	False	False	False	False	False

1708 rows × 7 columns

df.isna().any(axis = 0) # See which columns have NaN values
# Hint: read the documentation for df.any(): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html

st_name         False
year            False
c_beer          False
beer_tax        False
btax_dollars    False
population      False
salestax        False
dtype: bool

df.isna().any(axis = 1) # See which rows have NaN values

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
  False
Length: 1708, dtype: bool

print(df.isna().any(axis = 0).any()) #These can be used to check if there are NaN's anywhere in the data
print(df.isna().any(axis = 1).any()) 

False
False

.isna() also works for Series.

df['st_name'].isna()

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
  False
Name: st_name, Length: 1708, dtype: bool

df['st_name'].isna().any()

False

String Methods#

String methods are a great tool for working with string data. They only work on Series. To use them, you add a .str after your Series to indicate that you want to use a string method, followed by the method you want to use. Some of the various string methods are given below, but again this list is not comprehensive.

`len`#

Gives the length of the strings.

df['st_name'].str.len()

     2
     2
     2
     2
     2
       ..
  2
  2
  2
  2
  2
Name: st_name, Length: 1708, dtype: int64

String slicing#

Like list slicing, but for strings. Review the Python prereqs if you’re unsure.

df['st_name'].str[0]

     A
     A
     A
     A
     A
       ..
  W
  W
  W
  W
  W
Name: st_name, Length: 1708, dtype: object

`contains`#

Returns if a string contains the given substring.

df['st_name'].str.contains('A')

      True
      True
      True
      True
      True
        ...  
  False
  False
  False
  False
  False
Name: st_name, Length: 1708, dtype: bool

`replace`#

Replaces a substring with the other substring.

df['st_name'].str.replace('A','a')

     aL
     aL
     aL
     aL
     aL
        ..
  WY
  WY
  WY
  WY
  WY
Name: st_name, Length: 1708, dtype: object

`upper` and `lower`#

Capitalizes or uncapitalizes a Series.

df['st_name'].str.upper()

     AL
     AL
     AL
     AL
     AL
        ..
  WY
  WY
  WY
  WY
  WY
Name: st_name, Length: 1708, dtype: object

df['st_name'].str.lower()

     al
     al
     al
     al
     al
        ..
  wy
  wy
  wy
  wy
  wy
Name: st_name, Length: 1708, dtype: object

`startswith` and `endswith`#

Returns if a string starts with or ends with the given substring.

df['st_name'].str.startswith('A')

      True
      True
      True
      True
      True
        ...  
  False
  False
  False
  False
  False
Name: st_name, Length: 1708, dtype: bool

df['st_name'].str.endswith('A')

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
  False
Name: st_name, Length: 1708, dtype: bool

`split`#

Splits the strings on the given delimiter. Performs no splits if the delimiter doesn’t exist in the string. Returns a list of all the split substrings. Best understood by looking at an example.

samp_series = df.loc[[68, 102, 1574, 592, 1302, 1642],'st_name']
samp_series

    AZ
   AR
  WA
   LA
  RI
  WI
Name: st_name, dtype: object

samp_series.str.split('A')

    [, Z]
   [, R]
  [W, ]
   [L, ]
   [RI]
   [WI]
Name: st_name, dtype: object

As you can see above, all the strings have been split on the letter A. Strings which don’t have the letter A weren’t split.

If we want to return the output as a DataFrame, we can set the expand parameter to be True, as shown below.

samp_series.str.split('A', expand = True)

	0	1
68		Z
102		R
1574	W
592	L
1302	RI	None
1642	WI	None

DateTime Objects#

DateTime objects are incredibly helpful for working with dates. They help us easily perform calculations like add/subtract days/dates, convert timezones, sort dates, etc. We can convert a Series to a DateTime object by using pd.to_datetime(), as done below.

pd.to_datetime(df['year'], format='%Y')

    1970-01-01
    1971-01-01
    1972-01-01
    1973-01-01
    1974-01-01
          ...    
 1999-01-01
 2000-01-01
 2001-01-01
 2002-01-01
 2003-01-01
Name: year, Length: 1708, dtype: datetime64[ns]

Notice how we specify the format of our dates to ensure the conversion happens correctly. We don’t always need to do this (especially if we have the year, month and day in a well-known format), but it is good practice to ensure the conversion happens correctly.

Let us convert the year column in the DataFrame to a DateTime object and perform some calculations

df['year'] = pd.to_datetime(df['year'], format='%Y')
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970-01-01	33098	72.341130	2.370	3450	4.0
1	AL	1971-01-01	37598	69.304600	2.370	3497	4.0
2	AL	1972-01-01	42719	67.149190	2.370	3539	4.0
3	AL	1973-01-01	46203	63.217026	2.370	3580	4.0
4	AL	1974-01-01	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999-01-01	12423	0.319894	0.045	492	4.0
1704	WY	2000-01-01	12595	0.309491	0.045	494	4.0
1705	WY	2001-01-01	12808	0.300928	0.045	494	4.0
1706	WY	2002-01-01	13191	0.296244	0.045	499	4.0
1707	WY	2003-01-01	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Notice how the dates all default to 1st January. Let’s say we know that the measurements were actually done 4 days after the given date for each year. We can implement this in our DataFrame.

df['year'] = df['year'] + pd.DateOffset(days=4)
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970-01-05	33098	72.341130	2.370	3450	4.0
1	AL	1971-01-05	37598	69.304600	2.370	3497	4.0
2	AL	1972-01-05	42719	67.149190	2.370	3539	4.0
3	AL	1973-01-05	46203	63.217026	2.370	3580	4.0
4	AL	1974-01-05	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999-01-05	12423	0.319894	0.045	492	4.0
1704	WY	2000-01-05	12595	0.309491	0.045	494	4.0
1705	WY	2001-01-05	12808	0.300928	0.045	494	4.0
1706	WY	2002-01-05	13191	0.296244	0.045	499	4.0
1707	WY	2003-01-05	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Alternatively, if we instead know that the measurement was always taken on June 5th, we can specify that as well. Feel free to ignore the PerformanceWarning for now, but you can read more here if you’re interested.

df['year'] = df['year'] + pd.DateOffset(month=6, day=5)
df

/var/folders/td/dtym05z13119cjklqb5tyd040000gn/T/ipykernel_89001/3513634716.py:1: PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex.
  df['year'] = df['year'] + pd.DateOffset(month=6, day=5)

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970-06-05	33098	72.341130	2.370	3450	4.0
1	AL	1971-06-05	37598	69.304600	2.370	3497	4.0
2	AL	1972-06-05	42719	67.149190	2.370	3539	4.0
3	AL	1973-06-05	46203	63.217026	2.370	3580	4.0
4	AL	1974-06-05	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999-06-05	12423	0.319894	0.045	492	4.0
1704	WY	2000-06-05	12595	0.309491	0.045	494	4.0
1705	WY	2001-06-05	12808	0.300928	0.045	494	4.0
1706	WY	2002-06-05	13191	0.296244	0.045	499	4.0
1707	WY	2003-06-05	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Note the difference between the day and the days parameter; days adds to the current day while day replaces the current day. Similar differences exist for month vs months, year vs years, etc.

Now, let’s look at how a datapoint is stored.

df.loc[0,'year']

Timestamp('1970-06-05 00:00:00')

In the 20th index, the year is 1990. Let’s subtract the dates.

df.loc[20,'year'] - df.loc[0,'year']

Timedelta('7305 days 00:00:00')

Interesting, we can also subtract the year’s directly.

df.loc[20,'year'].year - df.loc[0,'year'].year

Let’s also try subtracting the months.

df.loc[20,'year'].month - df.loc[0,'year'].month

The months are subtracted directly. As both of these months were set to June, there is no difference. Instead, to see the number of months between two dates, we must first subtract the years, multiply that by 12, and then subtract the months and add that. The calculation has been done below.

(df.loc[20,'year'].year - df.loc[0,'year'].year) * 12 + (df.loc[20,'year'].month - df.loc[0,'year'].month)

Finally, using DateTime objects also helps us sort dates easily!

df.sort_values('year')

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970-06-05	33098	72.341130	2.3700	3450	4.00
170	CO	1970-06-05	42145	4.120698	0.1350	2224	3.00
592	LA	1970-06-05	68578	25.456755	0.8340	3650	2.00
1302	RI	1970-06-05	20360	5.686564	0.1863	951	5.00
860	MT	1970-06-05	17179	6.648059	0.2178	697	0.00
...	...	...	...	...	...	...	...
135	AR	2003-06-05	52045	3.507258	0.5449	2726	5.13
893	MT	2003-06-05	26100	2.008837	0.3121	918	0.00
1471	TX	2003-06-05	558837	2.867468	0.4455	22119	6.25
1675	WI	2003-06-05	151000	0.934582	0.1452	5472	5.00
1707	WY	2003-06-05	15535	0.289643	0.0450	501	4.00

1708 rows × 7 columns

Helpful Pandas Methods

Contents

Helpful Pandas Methods#

Summary Statistics#

describe#

Sorting, Counting and Uniqueness#

Sorting#

Counting#

Unique Values and Duplicates#

Missing Values#

String Methods#

len#

String slicing#

contains#

replace#

upper and lower#

startswith and endswith#

split#

DateTime Objects#

`describe`#

`len`#

`contains`#

`replace`#

`upper` and `lower`#

`startswith` and `endswith`#

`split`#