import pandas as pd
import statsmodels.api as sm
from statsmodels.api import OLS
Linear Regression#
Running linear regressions with pandas
DataFrames is easy! Let us begin by loading in dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey.
cps_df = pd.read_csv('data/cps.csv')
cps_df
state | age | wagesal | imm | hispanic | black | asian | educ | wage | logwage | female | fedwkr | statewkr | localwkr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | 44 | 18000 | 0 | 0 | 0 | 0 | 14 | 9.109312 | 2.209297 | 1 | 1 | 0 | 0 |
1 | 11 | 39 | 18000 | 0 | 0 | 0 | 0 | 14 | 18.000000 | 2.890372 | 0 | 0 | 0 | 0 |
2 | 11 | 39 | 35600 | 0 | 0 | 0 | 0 | 12 | 17.115385 | 2.839978 | 0 | 0 | 0 | 1 |
3 | 11 | 39 | 8000 | 0 | 0 | 0 | 0 | 14 | 5.128205 | 1.634756 | 1 | 0 | 0 | 0 |
4 | 11 | 39 | 100000 | 0 | 0 | 0 | 0 | 16 | 38.461538 | 3.649659 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21902 | 95 | 36 | 125000 | 0 | 0 | 0 | 0 | 18 | 60.096154 | 4.095946 | 0 | 0 | 1 | 0 |
21903 | 95 | 38 | 70000 | 0 | 0 | 0 | 1 | 18 | 26.923077 | 3.292984 | 1 | 0 | 0 | 0 |
21904 | 95 | 43 | 48208 | 0 | 0 | 0 | 0 | 14 | 20.601709 | 3.025374 | 1 | 0 | 0 | 0 |
21905 | 95 | 43 | 75000 | 0 | 0 | 0 | 0 | 18 | 36.057692 | 3.585120 | 0 | 0 | 0 | 0 |
21906 | 95 | 44 | 50000 | 1 | 0 | 0 | 1 | 20 | 24.038462 | 3.179655 | 1 | 0 | 1 | 0 |
21907 rows × 14 columns
statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, which is generally import as sm, we can use the following skeleton code:
x = data[[]]
y = data[]
model = sm.OLS(y, sm.add_constant(x))
result = model.fit()
result.summary()
In the above code, you begin by selecting your x-variables as a DataFrame and your y-variable as a Series. You then initialize an OLS model, adding an intercept term (with sm.add_constant()
) if necessary. Finally, you fit the OLS model and display the results. For example, below we run a regression where we estimate people’s log wage (logwage
) based on their number of years of educ (educ
), race (hispanic
, black
, asian
) and sex (female
). Note how we deliberately do not include the sex male
and the race white
in our regression to avoid linear dependency.
x = cps_df[['educ','hispanic','black','asian','female']]
y = cps_df['logwage']
model = sm.OLS(y, sm.add_constant(x))
result = model.fit()
result.summary()
Dep. Variable: | logwage | R-squared: | 0.250 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.250 |
Method: | Least Squares | F-statistic: | 1462. |
Date: | Wed, 10 Jan 2024 | Prob (F-statistic): | 0.00 |
Time: | 15:13:30 | Log-Likelihood: | -19851. |
No. Observations: | 21907 | AIC: | 3.971e+04 |
Df Residuals: | 21901 | BIC: | 3.976e+04 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 1.6476 | 0.022 | 73.311 | 0.000 | 1.604 | 1.692 |
educ | 0.1070 | 0.002 | 71.139 | 0.000 | 0.104 | 0.110 |
hispanic | -0.0717 | 0.011 | -6.333 | 0.000 | -0.094 | -0.050 |
black | -0.1250 | 0.014 | -9.249 | 0.000 | -0.152 | -0.099 |
asian | -0.0041 | 0.017 | -0.244 | 0.807 | -0.037 | 0.029 |
female | -0.2833 | 0.008 | -34.885 | 0.000 | -0.299 | -0.267 |
Omnibus: | 1131.830 | Durbin-Watson: | 1.852 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3713.696 |
Skew: | 0.188 | Prob(JB): | 0.00 |
Kurtosis: | 4.982 | Cond. No. | 82.6 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The regression looks good!
For more detailed information on running various types of regressions, feel free to look at the Econometrics
chapter from the online textbook Coding for Economists, or various chapters from the online textbook Causal Inference for The Brave and True.