Linear Regression

import pandas as pd
import statsmodels.api as sm
from statsmodels.api import OLS

Linear Regression#

Running linear regressions with pandas DataFrames is easy! Let us begin by loading in dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey.

cps_df = pd.read_csv('data/cps.csv')
cps_df
state age wagesal imm hispanic black asian educ wage logwage female fedwkr statewkr localwkr
0 11 44 18000 0 0 0 0 14 9.109312 2.209297 1 1 0 0
1 11 39 18000 0 0 0 0 14 18.000000 2.890372 0 0 0 0
2 11 39 35600 0 0 0 0 12 17.115385 2.839978 0 0 0 1
3 11 39 8000 0 0 0 0 14 5.128205 1.634756 1 0 0 0
4 11 39 100000 0 0 0 0 16 38.461538 3.649659 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21902 95 36 125000 0 0 0 0 18 60.096154 4.095946 0 0 1 0
21903 95 38 70000 0 0 0 1 18 26.923077 3.292984 1 0 0 0
21904 95 43 48208 0 0 0 0 14 20.601709 3.025374 1 0 0 0
21905 95 43 75000 0 0 0 0 18 36.057692 3.585120 0 0 0 0
21906 95 44 50000 1 0 0 1 20 24.038462 3.179655 1 0 1 0

21907 rows × 14 columns

statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, which is generally import as sm, we can use the following skeleton code:

x = data[[]]                                
y = data[]                                  
model = sm.OLS(y, sm.add_constant(x))      
result = model.fit()                        
result.summary()                            

In the above code, you begin by selecting your x-variables as a DataFrame and your y-variable as a Series. You then initialize an OLS model, adding an intercept term (with sm.add_constant()) if necessary. Finally, you fit the OLS model and display the results. For example, below we run a regression where we estimate people’s log wage (logwage) based on their number of years of educ (educ), race (hispanic, black, asian) and sex (female). Note how we deliberately do not include the sex male and the race white in our regression to avoid linear dependency.

x = cps_df[['educ','hispanic','black','asian','female']]                                
y = cps_df['logwage']                                  
model = sm.OLS(y, sm.add_constant(x))      
result = model.fit()                        
result.summary() 
OLS Regression Results
Dep. Variable: logwage R-squared: 0.250
Model: OLS Adj. R-squared: 0.250
Method: Least Squares F-statistic: 1462.
Date: Wed, 10 Jan 2024 Prob (F-statistic): 0.00
Time: 15:13:30 Log-Likelihood: -19851.
No. Observations: 21907 AIC: 3.971e+04
Df Residuals: 21901 BIC: 3.976e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.6476 0.022 73.311 0.000 1.604 1.692
educ 0.1070 0.002 71.139 0.000 0.104 0.110
hispanic -0.0717 0.011 -6.333 0.000 -0.094 -0.050
black -0.1250 0.014 -9.249 0.000 -0.152 -0.099
asian -0.0041 0.017 -0.244 0.807 -0.037 0.029
female -0.2833 0.008 -34.885 0.000 -0.299 -0.267
Omnibus: 1131.830 Durbin-Watson: 1.852
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3713.696
Skew: 0.188 Prob(JB): 0.00
Kurtosis: 4.982 Cond. No. 82.6


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression looks good!

For more detailed information on running various types of regressions, feel free to look at the Econometrics chapter from the online textbook Coding for Economists, or various chapters from the online textbook Causal Inference for The Brave and True.