Linear Regression

import pandas as pd
import statsmodels.api as sm
from statsmodels.api import OLS

Linear Regression#

Running linear regressions with pandas DataFrames is easy! Let us begin by loading in dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey.

cps_df = pd.read_csv('data/cps.csv')
cps_df

	state	age	wagesal	imm	hispanic	black	asian	educ	wage	logwage	female	fedwkr	statewkr	localwkr
0	11	44	18000	0	0	0	0	14	9.109312	2.209297	1	1	0	0
1	11	39	18000	0	0	0	0	14	18.000000	2.890372	0	0	0	0
2	11	39	35600	0	0	0	0	12	17.115385	2.839978	0	0	0	1
3	11	39	8000	0	0	0	0	14	5.128205	1.634756	1	0	0	0
4	11	39	100000	0	0	0	0	16	38.461538	3.649659	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
21902	95	36	125000	0	0	0	0	18	60.096154	4.095946	0	0	1	0
21903	95	38	70000	0	0	0	1	18	26.923077	3.292984	1	0	0	0
21904	95	43	48208	0	0	0	0	14	20.601709	3.025374	1	0	0	0
21905	95	43	75000	0	0	0	0	18	36.057692	3.585120	0	0	0	0
21906	95	44	50000	1	0	0	1	20	24.038462	3.179655	1	0	1	0

21907 rows × 14 columns

statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, which is generally import as sm, we can use the following skeleton code:

x = data[[]]                                
y = data[]                                  
model = sm.OLS(y, sm.add_constant(x))      
result = model.fit()                        
result.summary()                            

In the above code, you begin by selecting your x-variables as a DataFrame and your y-variable as a Series. You then initialize an OLS model, adding an intercept term (with sm.add_constant()) if necessary. Finally, you fit the OLS model and display the results. For example, below we run a regression where we estimate people’s log wage (logwage) based on their number of years of educ (educ), race (hispanic, black, asian) and sex (female). Note how we deliberately do not include the sex male and the race white in our regression to avoid linear dependency.

x = cps_df[['educ','hispanic','black','asian','female']]                                
y = cps_df['logwage']                                  
model = sm.OLS(y, sm.add_constant(x))      
result = model.fit()                        
result.summary() 

OLS Regression Results
Dep. Variable:	logwage	R-squared:	0.250
Model:	OLS	Adj. R-squared:	0.250
Method:	Least Squares	F-statistic:	1462.
Date:	Wed, 10 Jan 2024	Prob (F-statistic):	0.00
Time:	15:13:30	Log-Likelihood:	-19851.
No. Observations:	21907	AIC:	3.971e+04
Df Residuals:	21901	BIC:	3.976e+04
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	1.6476	0.022	73.311	0.000	1.604	1.692
educ	0.1070	0.002	71.139	0.000	0.104	0.110
hispanic	-0.0717	0.011	-6.333	0.000	-0.094	-0.050
black	-0.1250	0.014	-9.249	0.000	-0.152	-0.099
asian	-0.0041	0.017	-0.244	0.807	-0.037	0.029
female	-0.2833	0.008	-34.885	0.000	-0.299	-0.267

Omnibus:	1131.830	Durbin-Watson:	1.852
Prob(Omnibus):	0.000	Jarque-Bera (JB):	3713.696
Skew:	0.188	Prob(JB):	0.00
Kurtosis:	4.982	Cond. No.	82.6

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression looks good!

For more detailed information on running various types of regressions, feel free to look at the Econometrics chapter from the online textbook Coding for Economists, or various chapters from the online textbook Causal Inference for The Brave and True.