Selecting and Modifying Rows/Columns

Selecting and Modifying Rows/Columns#

This subchapter will go into greater detail on how you can select/modify rows and columns in pandas.

First, let us load in the same dataset as the first subchapter in the Pandas chapter. As a reminder, this dataset has beer sales across 50 continental states in the US. It is sourced from Salience and Taxation: Theory and Evidence by Chetty, Looney, and Kroft (AER 2009), and it includes 7 columns:

st_name: the state abbreviation
year: the year the data was recorded
c_beer: the quantity of beer consumed, in thousands of gallons
beer_tax: the ad valorem tax, as a percentage
btax_dollars: the excise tax, represented in dollars per case (24 cans) of beer
population: the population of the state, in thousands
salestax: the sales tax percentage

df = pd.read_csv('data/beer_tax.csv')
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.370	3450	4.0
1	AL	1971	37598	69.304600	2.370	3497	4.0
2	AL	1972	42719	67.149190	2.370	3539	4.0
3	AL	1973	46203	63.217026	2.370	3580	4.0
4	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Extracting Rows/Columns that Satisfy a Given Condition#

An important task we want to accomplish when looking at data is to filter out all rows that satisfy a certain condition. In other words, we want to extract all the rows where a certain thing is accomplished. Conditional selection helps us achieve exactly this, extracting all rows that satisfy a certain condition.

In the previous subchapter, we briefly gave an example of how we can use .loc to extract all rows that match certain values. To be more precise, when we pass in certain values for .loc and iloc to use, they generate an array of True/False values and use that to determine which values to include. This is very similar to our prior discussion on NumPy slicing.

For example, let’s say we want all the data for the state of Alabama. Then, we would want all rows where st_name is CA. We can generate a Series of True/False values corresponding to whether each row has a st_name of AL using the code below.

df['st_name'] == 'AL'

0        True
1        True
2        True
3        True
4        True
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

Now, to extract all rows where the state is Alabama, we can simply pass in this boolean series into the .loc operator.

df.loc[df['st_name'] == 'AL',:].head(5) # only showing the first 5 rows

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.37	3450	4.0
1	AL	1971	37598	69.304600	2.37	3497	4.0
2	AL	1972	42719	67.149190	2.37	3539	4.0
3	AL	1973	46203	63.217026	2.37	3580	4.0
4	AL	1974	49769	56.933796	2.37	3627	4.0

Similarly, we can also use the context dependent [] operator.

df[df['st_name'] == 'AL'].tail(5) # only showing the last 5 rows

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
29	AL	1999	93828	16.847755	2.37	4430	4.0
30	AL	2000	94530	16.299862	2.37	4452	4.0
31	AL	2001	92893	15.848877	2.37	4466	4.0
32	AL	2002	94732	15.602203	2.37	4479	4.0
33	AL	2003	96188	15.254544	2.37	4501	4.0

We could also use the same technique to extract all columns that are integers (but not floats, so they do not contain decimals).

df.loc[:,df.dtypes == int]

	year	c_beer	population
0	1970	33098	3450
1	1971	37598	3497
2	1972	42719	3539
3	1973	46203	3580
4	1974	49769	3627
...	...	...	...
1703	1999	12423	492
1704	2000	12595	494
1705	2001	12808	494
1706	2002	13191	499
1707	2003	15535	501

1708 rows × 3 columns

Alternatively, if we wanted to extract all the rows where the index is less than 30, we could similarly use the .iloc operator.

df.index < 30

array([ True,  True,  True, ..., False, False, False])

df.iloc[df.index < 30,:].tail(5) # only showing the last 5 rows

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
25	AL	1995	86991	18.417560	2.37	4297	4.0
26	AL	1996	89534	17.889332	2.37	4331	4.0
27	AL	1997	88807	17.488075	2.37	4368	4.0
28	AL	1998	92082	17.219852	2.37	4405	4.0
29	AL	1999	93828	16.847755	2.37	4430	4.0

As you can imagine, quickly and easily generating these boolean arrays/Series is very important for accessing our elements!

Logical Operators#

What if we have multiple conditions we want to filter on? Let’s say we want to look at all the rows where beer_tax is less than 3 and btax_dollars is less than 0.5. We can generate boolean filters for both of those conditions.

df['beer_tax'] < 3

0       False
1       False
2       False
3       False
4       False
        ...  
1703     True
1704     True
1705     True
1706     True
1707     True
Name: beer_tax, Length: 1708, dtype: bool

df['btax_dollars'] < 0.5

0       False
1       False
2       False
3       False
4       False
        ...  
1703     True
1704     True
1705     True
1706     True
1707     True
Name: btax_dollars, Length: 1708, dtype: bool

How can we combine both of these boolean arrays? Well, since we’re interested in cases where both of these are true, we can use the & operator. The & operator is the logical way of expressing you only want the cases where elements from both the arrays are true.

The code is shown below. Note the parenthesis around both the boolean arrays - this is an absolutely crucial piece of syntax!

df[(df['beer_tax'] < 3) & (df['btax_dollars'] < 0.5)]

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
77	AZ	1979	77684	2.936328	0.180	2636	4.0
78	AZ	1980	79660	2.587105	0.180	2736	4.0
79	AZ	1981	86283	2.345187	0.180	2810	4.0
80	AZ	1982	86249	2.209093	0.180	2890	4.0
81	AZ	1983	83067	2.140336	0.180	2969	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

628 rows × 7 columns

Alternatively, if we were looking for the states where either beer_tax is less than 3 or btax_dollars is less than 0.5, we could use the logical | operator. The | operator is the logical way of expressing you only want the cases where an element from either array is true.

df[(df['beer_tax'] < 3) | (df['btax_dollars'] < 0.5)]

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
68	AZ	1970	38604	5.494264	0.180	1795	3.0
69	AZ	1971	41837	5.263641	0.180	1896	3.0
70	AZ	1972	47949	5.099939	0.180	2008	3.0
71	AZ	1973	53380	4.801294	0.180	2124	3.0
72	AZ	1974	58188	4.324086	0.180	2223	3.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1144 rows × 7 columns

What about cases where either beer_tax is less than 3 or btax_dollars is less than 0.5, but not both? Well, here we could use the ^ operator, which is the logical way of expressing you only want the cases where an element from either of the arrays is true, but both the elements are not true.

df[(df['beer_tax'] < 3) ^ (df['btax_dollars'] < 0.5)]

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
68	AZ	1970	38604	5.494264	0.1800	1795	3.0
69	AZ	1971	41837	5.263641	0.1800	1896	3.0
70	AZ	1972	47949	5.099939	0.1800	2008	3.0
71	AZ	1973	53380	4.801294	0.1800	2124	3.0
72	AZ	1974	58188	4.324086	0.1800	2223	3.0
...	...	...	...	...	...	...	...
1644	WI	1972	128725	4.113951	0.1452	4502	4.0
1645	WI	1973	133792	3.873043	0.1452	4524	4.0
1646	WI	1974	136807	3.488096	0.1452	4546	4.0
1647	WI	1975	146350	3.196341	0.1452	4579	4.0
1648	WI	1976	142811	3.022199	0.1452	4596	4.0

516 rows × 7 columns

Notice how the number of columns with the & operator plus the number of columns with the ^ operator sum to the number of columns with the | operator (628 + 516 = 1144).

Finally, what if you’re interested in the cases where something isn’t true? For example, what if you’re interested in all the cases where the st_name isn’t AL? Well, then you can use the ~ operator, which logically represents the inverse of a boolean array (so it converts all true values into false and all false values intro true).

df['st_name'] == 'AL'

0        True
1        True
2        True
3        True
4        True
        ...  
1703    False
1704    False
1705    False
1706    False
1707    False
Name: st_name, Length: 1708, dtype: bool

~(df['st_name'] == 'AL')

0       False
1       False
2       False
3       False
4       False
        ...  
1703     True
1704     True
1705     True
1706     True
1707     True
Name: st_name, Length: 1708, dtype: bool

df[~(df['st_name'] == 'AL')]

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
34	AK	1970	5372	13.735660	0.5625	304	0.0
35	AK	1971	6336	13.159102	0.5625	316	0.0
36	AK	1972	6038	12.749847	0.5625	324	0.0
37	AK	1973	6453	12.003234	0.5625	331	0.0
38	AK	1974	7598	10.810215	0.5625	341	0.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.0450	492	4.0
1704	WY	2000	12595	0.309491	0.0450	494	4.0
1705	WY	2001	12808	0.300928	0.0450	494	4.0
1706	WY	2002	13191	0.296244	0.0450	499	4.0
1707	WY	2003	15535	0.289643	0.0450	501	4.0

1674 rows × 7 columns

Here’s a table summarizing all the different logical operators to combine boolean arrays.

Symbol	Usage	Meaning
~	~p	Returns the negation of p
\|	p \| q	p OR q
&	p & q	p AND q
^	p ^ q	p XOR q (exclusive or)

Multiple Labels#

If you want to keep all rows where a certain column is in a list of given values, you can use the .isin() method. This is logically equivalent to using the | operator multiple times, but it’s easier. For example, below we get all rows corresponding to the states of California, Florida, Texas and Nebraska.

df['st_name'].isin(['CA','FL','TX','NE'])

1707    False
1706    False
1705    False
1704    False
1703    False
        ...  
4       False
3       False
2       False
1       False
0       False
Name: st_name, Length: 1708, dtype: bool

df[df['st_name'].isin(['CA','FL','TX','NE'])]

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
1571	CA	1970	363645	2.747132	0.0900	20023	5.000
1570	CA	1971	380397	2.631820	0.0900	20346	5.000
1569	CA	1972	401928	2.549970	0.0900	20585	5.000
1568	CA	1973	417463	2.400647	0.0900	20869	5.167
1567	CA	1974	464237	2.162043	0.0900	21174	5.250
...	...	...	...	...	...	...	...
240	TX	1999	546353	3.166952	0.4455	20558	6.250
239	TX	2000	556051	3.063962	0.4455	20949	6.250
238	TX	2001	557425	2.979188	0.4455	21341	6.250
237	TX	2002	563150	2.932819	0.4455	21737	6.250
236	TX	2003	558837	2.867468	0.4455	22119	6.250

136 rows × 7 columns

Editing Rows#

We can also use conditional selection to edit certain rows. For example, let’s say we found out we made an error with measuring beer_tax for the state of Alabama, and the true beer tax is actually 0.1 higher than what we’ve measured. We can easily use boolean filtering to select the columns where we’ve made this mistake and fix them!

df.loc[df['st_name']=='AL', 'beer_tax']

0     72.341130
1     69.304600
2     67.149190
3     63.217026
4     56.933796
5     52.171677
6     49.329280
7     46.317430
8     43.049637
9     38.661655
10    34.063545
11    30.878284
12    29.086386
13    28.181087
14    27.014784
15    26.085838
16    25.609820
17    24.708065
18    23.726425
19    22.635775
20    21.475412
21    20.608194
22    20.005960
23    19.424470
24    18.939516
25    18.417560
26    17.889332
27    17.488075
28    17.219852
29    16.847755
30    16.299862
31    15.848877
32    15.602203
33    15.254544
Name: beer_tax, dtype: float64

df.loc[df['st_name']=='AL', 'beer_tax'] + 0.1

0     72.441130
1     69.404600
2     67.249190
3     63.317026
4     57.033796
5     52.271677
6     49.429280
7     46.417430
8     43.149637
9     38.761655
10    34.163545
11    30.978284
12    29.186386
13    28.281087
14    27.114784
15    26.185838
16    25.709820
17    24.808065
18    23.826425
19    22.735775
20    21.575412
21    20.708194
22    20.105960
23    19.524470
24    19.039516
25    18.517560
26    17.989332
27    17.588075
28    17.319852
29    16.947755
30    16.399862
31    15.948877
32    15.702203
33    15.354544
Name: beer_tax, dtype: float64

df.loc[df['st_name']=='AL', 'beer_tax'] = df.loc[df['st_name']=='AL', 'beer_tax'] + 0.1
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.441130	2.370	3450	4.0
1	AL	1971	37598	69.404600	2.370	3497	4.0
2	AL	1972	42719	67.249190	2.370	3539	4.0
3	AL	1973	46203	63.317026	2.370	3580	4.0
4	AL	1974	49769	57.033796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

It’s that simple! However, since our data was (presumably) correctly measured, let us revert this change.

df.loc[df['st_name']=='AL', 'beer_tax'] = df.loc[df['st_name']=='AL', 'beer_tax'] - 0.1
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	AL	1970	33098	72.341130	2.370	3450	4.0
1	AL	1971	37598	69.304600	2.370	3497	4.0
2	AL	1972	42719	67.149190	2.370	3539	4.0
3	AL	1973	46203	63.217026	2.370	3580	4.0
4	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Modifying Columns and Indices#

Now, we move on to modifying the columns in our DataFrames. The first task we will consider is how to add a new column.

Adding Columns#

Let’s say we want to add a new column where we calculate the total percentage taxed by combining the beer_tax and salestax. Both of these % taxes compound, so we will need to be careful with the math! Below, we calculate a series with the sum of these two taxes, as a percent.

((df['beer_tax'] / 100 + 1) * (df['salestax'] / 100 + 1) - 1) * 100

0       79.234775
1       76.076784
2       73.835158
3       69.745707
4       63.211148
          ...    
1703     4.332690
1704     4.321871
1705     4.312965
1706     4.308094
1707     4.301229
Length: 1708, dtype: float64

Now, we want to set this series equal to a new column named total_percent_tax in our dataframe. The syntax for doing this follows df[New_Column_Name] = Series/Array_Name, and is shown below.

df['total_percent_tax'] = ((df['beer_tax'] / 100 + 1) * (df['salestax'] / 100 + 1) - 1) * 100
df.head()

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax	total_percent_tax
0	AL	1970	33098	72.341130	2.37	3450	4.0	79.234775
1	AL	1971	37598	69.304600	2.37	3497	4.0	76.076784
2	AL	1972	42719	67.149190	2.37	3539	4.0	73.835158
3	AL	1973	46203	63.217026	2.37	3580	4.0	69.745707
4	AL	1974	49769	56.933796	2.37	3627	4.0	63.211148

Voila, our new column has been added!

Renaming Columns#

Looking at the table above, c_beer feels like a confusing name. Instead, let us change to beer_consumption, a more informative name! To do so, we can use the DataFrame rename method, as shown below.

df.rename(columns={"c_beer":"beer_consumption"})

	st_name	year	beer_consumption	beer_tax	btax_dollars	population	salestax	total_percent_tax
0	AL	1970	33098	72.341130	2.370	3450	4.0	79.234775
1	AL	1971	37598	69.304600	2.370	3497	4.0	76.076784
2	AL	1972	42719	67.149190	2.370	3539	4.0	73.835158
3	AL	1973	46203	63.217026	2.370	3580	4.0	69.745707
4	AL	1974	49769	56.933796	2.370	3627	4.0	63.211148
...	...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0	4.332690
1704	WY	2000	12595	0.309491	0.045	494	4.0	4.321871
1705	WY	2001	12808	0.300928	0.045	494	4.0	4.312965
1706	WY	2002	13191	0.296244	0.045	499	4.0	4.308094
1707	WY	2003	15535	0.289643	0.045	501	4.0	4.301229

1708 rows × 8 columns

As we can see, the columns parameter of the rename method takes in a dictionary where all the keys are the old column names and all the values are the new column names.

Note, however, that the original DataFrame df has not changed yet. Instead, pandas methods just return a copy of the modified DataFrame for you to see. To change the original dataframe, you can either manually assign the new DataFrame to df or set the inplace parameter of the rename method to True.

df # original dataframe has not changed

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax	total_percent_tax
0	AL	1970	33098	72.341130	2.370	3450	4.0	79.234775
1	AL	1971	37598	69.304600	2.370	3497	4.0	76.076784
2	AL	1972	42719	67.149190	2.370	3539	4.0	73.835158
3	AL	1973	46203	63.217026	2.370	3580	4.0	69.745707
4	AL	1974	49769	56.933796	2.370	3627	4.0	63.211148
...	...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0	4.332690
1704	WY	2000	12595	0.309491	0.045	494	4.0	4.321871
1705	WY	2001	12808	0.300928	0.045	494	4.0	4.312965
1706	WY	2002	13191	0.296244	0.045	499	4.0	4.308094
1707	WY	2003	15535	0.289643	0.045	501	4.0	4.301229

1708 rows × 8 columns

df = df.rename(columns={"c_beer":"beer_consumption"}) # you can manually reassign the `df` variable
df

	st_name	year	beer_consumption	beer_tax	btax_dollars	population	salestax	total_percent_tax
0	AL	1970	33098	72.341130	2.370	3450	4.0	79.234775
1	AL	1971	37598	69.304600	2.370	3497	4.0	76.076784
2	AL	1972	42719	67.149190	2.370	3539	4.0	73.835158
3	AL	1973	46203	63.217026	2.370	3580	4.0	69.745707
4	AL	1974	49769	56.933796	2.370	3627	4.0	63.211148
...	...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0	4.332690
1704	WY	2000	12595	0.309491	0.045	494	4.0	4.321871
1705	WY	2001	12808	0.300928	0.045	494	4.0	4.312965
1706	WY	2002	13191	0.296244	0.045	499	4.0	4.308094
1707	WY	2003	15535	0.289643	0.045	501	4.0	4.301229

1708 rows × 8 columns

df.rename(columns={"c_beer":"beer_consumption"}, inplace = True) # or you can use the inplace method
df

	st_name	year	beer_consumption	beer_tax	btax_dollars	population	salestax	total_percent_tax
0	AL	1970	33098	72.341130	2.370	3450	4.0	79.234775
1	AL	1971	37598	69.304600	2.370	3497	4.0	76.076784
2	AL	1972	42719	67.149190	2.370	3539	4.0	73.835158
3	AL	1973	46203	63.217026	2.370	3580	4.0	69.745707
4	AL	1974	49769	56.933796	2.370	3627	4.0	63.211148
...	...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.0	4.332690
1704	WY	2000	12595	0.309491	0.045	494	4.0	4.321871
1705	WY	2001	12808	0.300928	0.045	494	4.0	4.312965
1706	WY	2002	13191	0.296244	0.045	499	4.0	4.308094
1707	WY	2003	15535	0.289643	0.045	501	4.0	4.301229

1708 rows × 8 columns

Dropping Columns#

Finally, let’s say we want to drop a column. For this, we can use the DataFrame drop method. For example, we drop the salestax column below.

df.drop(columns=["salestax"])

	st_name	year	beer_consumption	beer_tax	btax_dollars	population	total_percent_tax
0	AL	1970	33098	72.341130	2.370	3450	79.234775
1	AL	1971	37598	69.304600	2.370	3497	76.076784
2	AL	1972	42719	67.149190	2.370	3539	73.835158
3	AL	1973	46203	63.217026	2.370	3580	69.745707
4	AL	1974	49769	56.933796	2.370	3627	63.211148
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.332690
1704	WY	2000	12595	0.309491	0.045	494	4.321871
1705	WY	2001	12808	0.300928	0.045	494	4.312965
1706	WY	2002	13191	0.296244	0.045	499	4.308094
1707	WY	2003	15535	0.289643	0.045	501	4.301229

1708 rows × 7 columns

Once again, the drop method has a columns parameter that takes in a list of the columns to drop (no need for a dictionary here as there are no key-value pairs).

However, you can also simply pass in a list of the columns to drop and then use the axis parameter to specify they represent column names, as done below. We encourage you to read the documentation if you’re interested in further detail!

df.drop(["salestax","beer_tax"], axis = 'columns')

	st_name	year	beer_consumption	btax_dollars	population	total_percent_tax
0	AL	1970	33098	2.370	3450	79.234775
1	AL	1971	37598	2.370	3497	76.076784
2	AL	1972	42719	2.370	3539	73.835158
3	AL	1973	46203	2.370	3580	69.745707
4	AL	1974	49769	2.370	3627	63.211148
...	...	...	...	...	...	...
1703	WY	1999	12423	0.045	492	4.332690
1704	WY	2000	12595	0.045	494	4.321871
1705	WY	2001	12808	0.045	494	4.312965
1706	WY	2002	13191	0.045	499	4.308094
1707	WY	2003	15535	0.045	501	4.301229

1708 rows × 6 columns

Finally, once again the drop parameter does not modify the original table. To modify the original table, you can once again reassign it to the original table or use the inplace method.

df = df.drop(columns=["salestax"]) # reassigning to df
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	total_percent_tax
0	AL	1970	33098	72.341130	2.370	3450	79.234775
1	AL	1971	37598	69.304600	2.370	3497	76.076784
2	AL	1972	42719	67.149190	2.370	3539	73.835158
3	AL	1973	46203	63.217026	2.370	3580	69.745707
4	AL	1974	49769	56.933796	2.370	3627	63.211148
...	...	...	...	...	...	...	...
1703	WY	1999	12423	0.319894	0.045	492	4.332690
1704	WY	2000	12595	0.309491	0.045	494	4.321871
1705	WY	2001	12808	0.300928	0.045	494	4.312965
1706	WY	2002	13191	0.296244	0.045	499	4.308094
1707	WY	2003	15535	0.289643	0.045	501	4.301229

1708 rows × 7 columns

df.drop(["salestax","beer_tax"], axis = 'columns', inplace = True) # using inplace parameter
df

	st_name	year	c_beer	btax_dollars	population	total_percent_tax
0	AL	1970	33098	2.370	3450	79.234775
1	AL	1971	37598	2.370	3497	76.076784
2	AL	1972	42719	2.370	3539	73.835158
3	AL	1973	46203	2.370	3580	69.745707
4	AL	1974	49769	2.370	3627	63.211148
...	...	...	...	...	...	...
1703	WY	1999	12423	0.045	492	4.332690
1704	WY	2000	12595	0.045	494	4.321871
1705	WY	2001	12808	0.045	494	4.312965
1706	WY	2002	13191	0.045	499	4.308094
1707	WY	2003	15535	0.045	501	4.301229

1708 rows × 6 columns

Modifying the Index#

To modify the index of a DataFrame, there are three main methods, as discussed below.

.set_index() Sets the index of the DataFrame to a particular column.

df.set_index('st_name', inplace = True)

	year	c_beer	beer_tax	btax_dollars	population	salestax
st_name
AL	1970	33098	72.341130	2.370	3450	4.0
AL	1971	37598	69.304600	2.370	3497	4.0
AL	1972	42719	67.149190	2.370	3539	4.0
AL	1973	46203	63.217026	2.370	3580	4.0
AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...
WY	1999	12423	0.319894	0.045	492	4.0
WY	2000	12595	0.309491	0.045	494	4.0
WY	2001	12808	0.300928	0.045	494	4.0
WY	2002	13191	0.296244	0.045	499	4.0
WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 6 columns

.reset_index() converts the current index into a column and makes the new index as a default numbering from 0 to the very last row in the DataFrame.

df.reset_index()

	index	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
0	0	AL	1970	33098	72.341130	2.370	3450	4.0
1	1	AL	1971	37598	69.304600	2.370	3497	4.0
2	2	AL	1972	42719	67.149190	2.370	3539	4.0
3	3	AL	1973	46203	63.217026	2.370	3580	4.0
4	4	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...	...
1703	1703	WY	1999	12423	0.319894	0.045	492	4.0
1704	1704	WY	2000	12595	0.309491	0.045	494	4.0
1705	1705	WY	2001	12808	0.300928	0.045	494	4.0
1706	1706	WY	2002	13191	0.296244	0.045	499	4.0
1707	1707	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 8 columns

Finally, you can manually set the index equal to something else by simply reassigning the DataFrame/Series attribute.

df.index = np.arange(len(df)-1, -1, -1)
df

	st_name	year	c_beer	beer_tax	btax_dollars	population	salestax
1707	AL	1970	33098	72.341130	2.370	3450	4.0
1706	AL	1971	37598	69.304600	2.370	3497	4.0
1705	AL	1972	42719	67.149190	2.370	3539	4.0
1704	AL	1973	46203	63.217026	2.370	3580	4.0
1703	AL	1974	49769	56.933796	2.370	3627	4.0
...	...	...	...	...	...	...	...
4	WY	1999	12423	0.319894	0.045	492	4.0
3	WY	2000	12595	0.309491	0.045	494	4.0
2	WY	2001	12808	0.300928	0.045	494	4.0
1	WY	2002	13191	0.296244	0.045	499	4.0
0	WY	2003	15535	0.289643	0.045	501	4.0

1708 rows × 7 columns

Selecting and Modifying Rows/Columns

Contents

Selecting and Modifying Rows/Columns#

Extracting Rows/Columns that Satisfy a Given Condition#

Logical Operators#

Multiple Labels#

Editing Rows#

Modifying Columns and Indices#

Adding Columns#

Renaming Columns#

Dropping Columns#

Modifying the Index#