import pandas as pd
import numpy as np
Selecting and Modifying Rows/Columns#
This subchapter will go into greater detail on how you can select/modify rows and columns in pandas
.
First, let us load in the same dataset as the first subchapter in the Pandas
chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:
st_name
: the state abbreviationyear
: the year the data was recordedc_beer
: the quantity of beer consumed, in thousands of gallonsbeer_tax
: the ad valorem tax, as a percentagebtax_dollars
: the excise tax, represented in dollars per case (24 cans) of beerpopulation
: the population of the state, in thousandssalestax
: the sales tax percentage
df = pd.read_csv('data/beer_tax.csv')
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Extracting Rows/Columns that Satisfy a Given Condition#
An important task we want to accomplish when looking at data is to filter out all rows that satisfy a certain condition. In other words, we want to extract all the rows where a certain thing is accomplished. Conditional selection helps us achieve exactly this, extracting all rows that satisfy a certain condition.
In the previous subchapter, we briefly gave an example of how we can use .loc
to extract all rows that match certain values. To be more precise, when we pass in certain values for .loc
and iloc
to use, they generate an array of True/False values and use that to determine which values to include. This is very similar to our prior discussion on NumPy slicing.
For example, let’s say we want all the data for the state of Alabama. Then, we would want all rows where st_name
is CA
. We can generate a Series of True/False values corresponding to whether each row has a st_name
of AL
using the code below.
df['st_name'] == 'AL'
0 True
1 True
2 True
3 True
4 True
...
1703 False
1704 False
1705 False
1706 False
1707 False
Name: st_name, Length: 1708, dtype: bool
Now, to extract all rows where the state is Alabama, we can simply pass in this boolean series into the .loc
operator.
df.loc[df['st_name'] == 'AL',:].head(5) # only showing the first 5 rows
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.37 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.37 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.37 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.37 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.37 | 3627 | 4.0 |
Similarly, we can also use the context dependent []
operator.
df[df['st_name'] == 'AL'].tail(5) # only showing the last 5 rows
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
29 | AL | 1999 | 93828 | 16.847755 | 2.37 | 4430 | 4.0 |
30 | AL | 2000 | 94530 | 16.299862 | 2.37 | 4452 | 4.0 |
31 | AL | 2001 | 92893 | 15.848877 | 2.37 | 4466 | 4.0 |
32 | AL | 2002 | 94732 | 15.602203 | 2.37 | 4479 | 4.0 |
33 | AL | 2003 | 96188 | 15.254544 | 2.37 | 4501 | 4.0 |
We could also use the same technique to extract all columns that are integers (but not floats, so they do not contain decimals).
df.loc[:,df.dtypes == int]
year | c_beer | population | |
---|---|---|---|
0 | 1970 | 33098 | 3450 |
1 | 1971 | 37598 | 3497 |
2 | 1972 | 42719 | 3539 |
3 | 1973 | 46203 | 3580 |
4 | 1974 | 49769 | 3627 |
... | ... | ... | ... |
1703 | 1999 | 12423 | 492 |
1704 | 2000 | 12595 | 494 |
1705 | 2001 | 12808 | 494 |
1706 | 2002 | 13191 | 499 |
1707 | 2003 | 15535 | 501 |
1708 rows × 3 columns
Alternatively, if we wanted to extract all the rows where the index is less than 30, we could similarly use the .iloc
operator.
df.index < 30
array([ True, True, True, ..., False, False, False])
df.iloc[df.index < 30,:].tail(5) # only showing the last 5 rows
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
25 | AL | 1995 | 86991 | 18.417560 | 2.37 | 4297 | 4.0 |
26 | AL | 1996 | 89534 | 17.889332 | 2.37 | 4331 | 4.0 |
27 | AL | 1997 | 88807 | 17.488075 | 2.37 | 4368 | 4.0 |
28 | AL | 1998 | 92082 | 17.219852 | 2.37 | 4405 | 4.0 |
29 | AL | 1999 | 93828 | 16.847755 | 2.37 | 4430 | 4.0 |
As you can imagine, quickly and easily generating these boolean arrays/Series is very important for accessing our elements!
Logical Operators#
What if we have multiple conditions we want to filter on? Let’s say we want to look at all the rows where beer_tax
is less than 3 and btax_dollars
is less than 0.5. We can generate boolean filters for both of those conditions.
df['beer_tax'] < 3
0 False
1 False
2 False
3 False
4 False
...
1703 True
1704 True
1705 True
1706 True
1707 True
Name: beer_tax, Length: 1708, dtype: bool
df['btax_dollars'] < 0.5
0 False
1 False
2 False
3 False
4 False
...
1703 True
1704 True
1705 True
1706 True
1707 True
Name: btax_dollars, Length: 1708, dtype: bool
How can we combine both of these boolean arrays? Well, since we’re interested in cases where both of these are true, we can use the &
operator. The &
operator is the logical way of expressing you only want the cases where elements from both the arrays are true.
The code is shown below. Note the parenthesis around both the boolean arrays - this is an absolutely crucial piece of syntax!
df[(df['beer_tax'] < 3) & (df['btax_dollars'] < 0.5)]
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
77 | AZ | 1979 | 77684 | 2.936328 | 0.180 | 2636 | 4.0 |
78 | AZ | 1980 | 79660 | 2.587105 | 0.180 | 2736 | 4.0 |
79 | AZ | 1981 | 86283 | 2.345187 | 0.180 | 2810 | 4.0 |
80 | AZ | 1982 | 86249 | 2.209093 | 0.180 | 2890 | 4.0 |
81 | AZ | 1983 | 83067 | 2.140336 | 0.180 | 2969 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
628 rows × 7 columns
Alternatively, if we were looking for the states where either beer_tax
is less than 3 or btax_dollars
is less than 0.5, we could use the logical |
operator. The |
operator is the logical way of expressing you only want the cases where an element from either array is true.
df[(df['beer_tax'] < 3) | (df['btax_dollars'] < 0.5)]
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
68 | AZ | 1970 | 38604 | 5.494264 | 0.180 | 1795 | 3.0 |
69 | AZ | 1971 | 41837 | 5.263641 | 0.180 | 1896 | 3.0 |
70 | AZ | 1972 | 47949 | 5.099939 | 0.180 | 2008 | 3.0 |
71 | AZ | 1973 | 53380 | 4.801294 | 0.180 | 2124 | 3.0 |
72 | AZ | 1974 | 58188 | 4.324086 | 0.180 | 2223 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1144 rows × 7 columns
What about cases where either beer_tax
is less than 3 or btax_dollars
is less than 0.5, but not both? Well, here we could use the ^
operator, which is the logical way of expressing you only want the cases where an element from either of the arrays is true, but both the elements are not true.
df[(df['beer_tax'] < 3) ^ (df['btax_dollars'] < 0.5)]
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
68 | AZ | 1970 | 38604 | 5.494264 | 0.1800 | 1795 | 3.0 |
69 | AZ | 1971 | 41837 | 5.263641 | 0.1800 | 1896 | 3.0 |
70 | AZ | 1972 | 47949 | 5.099939 | 0.1800 | 2008 | 3.0 |
71 | AZ | 1973 | 53380 | 4.801294 | 0.1800 | 2124 | 3.0 |
72 | AZ | 1974 | 58188 | 4.324086 | 0.1800 | 2223 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1644 | WI | 1972 | 128725 | 4.113951 | 0.1452 | 4502 | 4.0 |
1645 | WI | 1973 | 133792 | 3.873043 | 0.1452 | 4524 | 4.0 |
1646 | WI | 1974 | 136807 | 3.488096 | 0.1452 | 4546 | 4.0 |
1647 | WI | 1975 | 146350 | 3.196341 | 0.1452 | 4579 | 4.0 |
1648 | WI | 1976 | 142811 | 3.022199 | 0.1452 | 4596 | 4.0 |
516 rows × 7 columns
Notice how the number of columns with the &
operator plus the number of columns with the ^
operator sum to the number of columns with the |
operator (628 + 516 = 1144).
Finally, what if you’re interested in the cases where something isn’t true? For example, what if you’re interested in all the cases where the st_name
isn’t AL
? Well, then you can use the ~
operator, which logically represents the inverse of a boolean array (so it converts all true values into false and all false values intro true).
df['st_name'] == 'AL'
0 True
1 True
2 True
3 True
4 True
...
1703 False
1704 False
1705 False
1706 False
1707 False
Name: st_name, Length: 1708, dtype: bool
~(df['st_name'] == 'AL')
0 False
1 False
2 False
3 False
4 False
...
1703 True
1704 True
1705 True
1706 True
1707 True
Name: st_name, Length: 1708, dtype: bool
df[~(df['st_name'] == 'AL')]
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
34 | AK | 1970 | 5372 | 13.735660 | 0.5625 | 304 | 0.0 |
35 | AK | 1971 | 6336 | 13.159102 | 0.5625 | 316 | 0.0 |
36 | AK | 1972 | 6038 | 12.749847 | 0.5625 | 324 | 0.0 |
37 | AK | 1973 | 6453 | 12.003234 | 0.5625 | 331 | 0.0 |
38 | AK | 1974 | 7598 | 10.810215 | 0.5625 | 341 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.0450 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.0450 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.0450 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.0450 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.0450 | 501 | 4.0 |
1674 rows × 7 columns
Here’s a table summarizing all the different logical operators to combine boolean arrays.
Symbol |
Usage |
Meaning |
---|---|---|
~ |
~p |
Returns the negation of p |
| |
p | q |
p OR q |
& |
p & q |
p AND q |
^ |
p ^ q |
p XOR q (exclusive or) |
Multiple Labels#
If you want to keep all rows where a certain column is in a list of given values, you can use the .isin()
method. This is logically equivalent to using the |
operator multiple times, but it’s easier. For example, below we get all rows corresponding to the states of California, Florida, Texas and Nebraska.
df['st_name'].isin(['CA','FL','TX','NE'])
1707 False
1706 False
1705 False
1704 False
1703 False
...
4 False
3 False
2 False
1 False
0 False
Name: st_name, Length: 1708, dtype: bool
df[df['st_name'].isin(['CA','FL','TX','NE'])]
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
1571 | CA | 1970 | 363645 | 2.747132 | 0.0900 | 20023 | 5.000 |
1570 | CA | 1971 | 380397 | 2.631820 | 0.0900 | 20346 | 5.000 |
1569 | CA | 1972 | 401928 | 2.549970 | 0.0900 | 20585 | 5.000 |
1568 | CA | 1973 | 417463 | 2.400647 | 0.0900 | 20869 | 5.167 |
1567 | CA | 1974 | 464237 | 2.162043 | 0.0900 | 21174 | 5.250 |
... | ... | ... | ... | ... | ... | ... | ... |
240 | TX | 1999 | 546353 | 3.166952 | 0.4455 | 20558 | 6.250 |
239 | TX | 2000 | 556051 | 3.063962 | 0.4455 | 20949 | 6.250 |
238 | TX | 2001 | 557425 | 2.979188 | 0.4455 | 21341 | 6.250 |
237 | TX | 2002 | 563150 | 2.932819 | 0.4455 | 21737 | 6.250 |
236 | TX | 2003 | 558837 | 2.867468 | 0.4455 | 22119 | 6.250 |
136 rows × 7 columns
Editing Rows#
We can also use conditional selection to edit certain rows. For example, let’s say we found out we made an error with measuring beer_tax
for the state of Alabama, and the true beer tax is actually 0.1 higher than what we’ve measured. We can easily use boolean filtering to select the columns where we’ve made this mistake and fix them!
df.loc[df['st_name']=='AL', 'beer_tax']
0 72.341130
1 69.304600
2 67.149190
3 63.217026
4 56.933796
5 52.171677
6 49.329280
7 46.317430
8 43.049637
9 38.661655
10 34.063545
11 30.878284
12 29.086386
13 28.181087
14 27.014784
15 26.085838
16 25.609820
17 24.708065
18 23.726425
19 22.635775
20 21.475412
21 20.608194
22 20.005960
23 19.424470
24 18.939516
25 18.417560
26 17.889332
27 17.488075
28 17.219852
29 16.847755
30 16.299862
31 15.848877
32 15.602203
33 15.254544
Name: beer_tax, dtype: float64
df.loc[df['st_name']=='AL', 'beer_tax'] + 0.1
0 72.441130
1 69.404600
2 67.249190
3 63.317026
4 57.033796
5 52.271677
6 49.429280
7 46.417430
8 43.149637
9 38.761655
10 34.163545
11 30.978284
12 29.186386
13 28.281087
14 27.114784
15 26.185838
16 25.709820
17 24.808065
18 23.826425
19 22.735775
20 21.575412
21 20.708194
22 20.105960
23 19.524470
24 19.039516
25 18.517560
26 17.989332
27 17.588075
28 17.319852
29 16.947755
30 16.399862
31 15.948877
32 15.702203
33 15.354544
Name: beer_tax, dtype: float64
df.loc[df['st_name']=='AL', 'beer_tax'] = df.loc[df['st_name']=='AL', 'beer_tax'] + 0.1
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.441130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.404600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.249190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.317026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 57.033796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
It’s that simple! However, since our data was (presumably) correctly measured, let us revert this change.
df.loc[df['st_name']=='AL', 'beer_tax'] = df.loc[df['st_name']=='AL', 'beer_tax'] - 0.1
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns
Modifying Columns and Indices#
Now, we move on to modifying the columns in our DataFrames. The first task we will consider is how to add a new column.
Adding Columns#
Let’s say we want to add a new column where we calculate the total percentage taxed by combining the beer_tax
and salestax
. Both of these % taxes compound, so we will need to be careful with the math! Below, we calculate a series with the sum of these two taxes, as a percent.
((df['beer_tax'] / 100 + 1) * (df['salestax'] / 100 + 1) - 1) * 100
0 79.234775
1 76.076784
2 73.835158
3 69.745707
4 63.211148
...
1703 4.332690
1704 4.321871
1705 4.312965
1706 4.308094
1707 4.301229
Length: 1708, dtype: float64
Now, we want to set this series equal to a new column named total_percent_tax
in our dataframe. The syntax for doing this follows df[New_Column_Name] = Series/Array_Name
, and is shown below.
df['total_percent_tax'] = ((df['beer_tax'] / 100 + 1) * (df['salestax'] / 100 + 1) - 1) * 100
df.head()
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | total_percent_tax | |
---|---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.37 | 3450 | 4.0 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.37 | 3497 | 4.0 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.37 | 3539 | 4.0 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.37 | 3580 | 4.0 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.37 | 3627 | 4.0 | 63.211148 |
Voila, our new column has been added!
Renaming Columns#
Looking at the table above, c_beer
feels like a confusing name. Instead, let us change to beer_consumption
, a more informative name! To do so, we can use the DataFrame rename
method, as shown below.
df.rename(columns={"c_beer":"beer_consumption"})
st_name | year | beer_consumption | beer_tax | btax_dollars | population | salestax | total_percent_tax | |
---|---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 | 63.211148 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 | 4.301229 |
1708 rows × 8 columns
As we can see, the columns
parameter of the rename
method takes in a dictionary where all the keys are the old column names and all the values are the new column names.
Note, however, that the original DataFrame df
has not changed yet. Instead, pandas
methods just return a copy of the modified DataFrame for you to see. To change the original dataframe, you can either manually assign the new DataFrame to df
or set the inplace
parameter of the rename
method to True
.
df # original dataframe has not changed
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | total_percent_tax | |
---|---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 | 63.211148 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 | 4.301229 |
1708 rows × 8 columns
df = df.rename(columns={"c_beer":"beer_consumption"}) # you can manually reassign the `df` variable
df
st_name | year | beer_consumption | beer_tax | btax_dollars | population | salestax | total_percent_tax | |
---|---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 | 63.211148 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 | 4.301229 |
1708 rows × 8 columns
df.rename(columns={"c_beer":"beer_consumption"}, inplace = True) # or you can use the inplace method
df
st_name | year | beer_consumption | beer_tax | btax_dollars | population | salestax | total_percent_tax | |
---|---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 | 63.211148 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 | 4.301229 |
1708 rows × 8 columns
Dropping Columns#
Finally, let’s say we want to drop a column. For this, we can use the DataFrame drop
method. For example, we drop the salestax
column below.
df.drop(columns=["salestax"])
st_name | year | beer_consumption | beer_tax | btax_dollars | population | total_percent_tax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 63.211148 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.301229 |
1708 rows × 7 columns
Once again, the drop
method has a columns
parameter that takes in a list of the columns to drop (no need for a dictionary here as there are no key-value pairs).
However, you can also simply pass in a list of the columns to drop and then use the axis
parameter to specify they represent column names, as done below. We encourage you to read the documentation if you’re interested in further detail!
df.drop(["salestax","beer_tax"], axis = 'columns')
st_name | year | beer_consumption | btax_dollars | population | total_percent_tax | |
---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 2.370 | 3450 | 79.234775 |
1 | AL | 1971 | 37598 | 2.370 | 3497 | 76.076784 |
2 | AL | 1972 | 42719 | 2.370 | 3539 | 73.835158 |
3 | AL | 1973 | 46203 | 2.370 | 3580 | 69.745707 |
4 | AL | 1974 | 49769 | 2.370 | 3627 | 63.211148 |
... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.045 | 492 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.045 | 494 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.045 | 494 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.045 | 499 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.045 | 501 | 4.301229 |
1708 rows × 6 columns
Finally, once again the drop
parameter does not modify the original table. To modify the original table, you can once again reassign it to the original table or use the inplace
method.
df = df.drop(columns=["salestax"]) # reassigning to df
df
st_name | year | c_beer | beer_tax | btax_dollars | population | total_percent_tax | |
---|---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 79.234775 |
1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 76.076784 |
2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 73.835158 |
3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 69.745707 |
4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 63.211148 |
... | ... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.301229 |
1708 rows × 7 columns
df.drop(["salestax","beer_tax"], axis = 'columns', inplace = True) # using inplace parameter
df
st_name | year | c_beer | btax_dollars | population | total_percent_tax | |
---|---|---|---|---|---|---|
0 | AL | 1970 | 33098 | 2.370 | 3450 | 79.234775 |
1 | AL | 1971 | 37598 | 2.370 | 3497 | 76.076784 |
2 | AL | 1972 | 42719 | 2.370 | 3539 | 73.835158 |
3 | AL | 1973 | 46203 | 2.370 | 3580 | 69.745707 |
4 | AL | 1974 | 49769 | 2.370 | 3627 | 63.211148 |
... | ... | ... | ... | ... | ... | ... |
1703 | WY | 1999 | 12423 | 0.045 | 492 | 4.332690 |
1704 | WY | 2000 | 12595 | 0.045 | 494 | 4.321871 |
1705 | WY | 2001 | 12808 | 0.045 | 494 | 4.312965 |
1706 | WY | 2002 | 13191 | 0.045 | 499 | 4.308094 |
1707 | WY | 2003 | 15535 | 0.045 | 501 | 4.301229 |
1708 rows × 6 columns
Modifying the Index#
To modify the index of a DataFrame, there are three main methods, as discussed below.
.set_index()
Sets the index of the DataFrame to a particular column.
df.set_index('st_name', inplace = True)
year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|
st_name | ||||||
AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... |
WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 6 columns
.reset_index()
converts the current index into a column and makes the new index as a default numbering from 0 to the very last row in the DataFrame.
df.reset_index()
index | st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|---|
0 | 0 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1 | 1 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
2 | 2 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
3 | 3 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
4 | 4 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1703 | 1703 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
1704 | 1704 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
1705 | 1705 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1706 | 1706 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
1707 | 1707 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 8 columns
Finally, you can manually set the index equal to something else by simply reassigning the DataFrame/Series attribute.
df.index = np.arange(len(df)-1, -1, -1)
df
st_name | year | c_beer | beer_tax | btax_dollars | population | salestax | |
---|---|---|---|---|---|---|---|
1707 | AL | 1970 | 33098 | 72.341130 | 2.370 | 3450 | 4.0 |
1706 | AL | 1971 | 37598 | 69.304600 | 2.370 | 3497 | 4.0 |
1705 | AL | 1972 | 42719 | 67.149190 | 2.370 | 3539 | 4.0 |
1704 | AL | 1973 | 46203 | 63.217026 | 2.370 | 3580 | 4.0 |
1703 | AL | 1974 | 49769 | 56.933796 | 2.370 | 3627 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
4 | WY | 1999 | 12423 | 0.319894 | 0.045 | 492 | 4.0 |
3 | WY | 2000 | 12595 | 0.309491 | 0.045 | 494 | 4.0 |
2 | WY | 2001 | 12808 | 0.300928 | 0.045 | 494 | 4.0 |
1 | WY | 2002 | 13191 | 0.296244 | 0.045 | 499 | 4.0 |
0 | WY | 2003 | 15535 | 0.289643 | 0.045 | 501 | 4.0 |
1708 rows × 7 columns