{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "9aad9273",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import statsmodels.api as sm\n",
"from statsmodels.api import OLS"
]
},
{
"cell_type": "markdown",
"id": "334763be",
"metadata": {},
"source": [
"# Linear Regression"
]
},
{
"cell_type": "markdown",
"id": "038df8c3",
"metadata": {},
"source": [
"Running linear regressions with `pandas` DataFrames is easy! Let us begin by loading in dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d4f26af6",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" state | \n",
" age | \n",
" wagesal | \n",
" imm | \n",
" hispanic | \n",
" black | \n",
" asian | \n",
" educ | \n",
" wage | \n",
" logwage | \n",
" female | \n",
" fedwkr | \n",
" statewkr | \n",
" localwkr | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 11 | \n",
" 44 | \n",
" 18000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 14 | \n",
" 9.109312 | \n",
" 2.209297 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 11 | \n",
" 39 | \n",
" 18000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 14 | \n",
" 18.000000 | \n",
" 2.890372 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 11 | \n",
" 39 | \n",
" 35600 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 12 | \n",
" 17.115385 | \n",
" 2.839978 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 11 | \n",
" 39 | \n",
" 8000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 14 | \n",
" 5.128205 | \n",
" 1.634756 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 11 | \n",
" 39 | \n",
" 100000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 16 | \n",
" 38.461538 | \n",
" 3.649659 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 21902 | \n",
" 95 | \n",
" 36 | \n",
" 125000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 18 | \n",
" 60.096154 | \n",
" 4.095946 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 21903 | \n",
" 95 | \n",
" 38 | \n",
" 70000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 18 | \n",
" 26.923077 | \n",
" 3.292984 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 21904 | \n",
" 95 | \n",
" 43 | \n",
" 48208 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 14 | \n",
" 20.601709 | \n",
" 3.025374 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 21905 | \n",
" 95 | \n",
" 43 | \n",
" 75000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 18 | \n",
" 36.057692 | \n",
" 3.585120 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 21906 | \n",
" 95 | \n",
" 44 | \n",
" 50000 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 20 | \n",
" 24.038462 | \n",
" 3.179655 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
21907 rows × 14 columns
\n",
"
"
],
"text/plain": [
" state age wagesal imm hispanic black asian educ wage \\\n",
"0 11 44 18000 0 0 0 0 14 9.109312 \n",
"1 11 39 18000 0 0 0 0 14 18.000000 \n",
"2 11 39 35600 0 0 0 0 12 17.115385 \n",
"3 11 39 8000 0 0 0 0 14 5.128205 \n",
"4 11 39 100000 0 0 0 0 16 38.461538 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"21902 95 36 125000 0 0 0 0 18 60.096154 \n",
"21903 95 38 70000 0 0 0 1 18 26.923077 \n",
"21904 95 43 48208 0 0 0 0 14 20.601709 \n",
"21905 95 43 75000 0 0 0 0 18 36.057692 \n",
"21906 95 44 50000 1 0 0 1 20 24.038462 \n",
"\n",
" logwage female fedwkr statewkr localwkr \n",
"0 2.209297 1 1 0 0 \n",
"1 2.890372 0 0 0 0 \n",
"2 2.839978 0 0 0 1 \n",
"3 1.634756 1 0 0 0 \n",
"4 3.649659 0 1 0 0 \n",
"... ... ... ... ... ... \n",
"21902 4.095946 0 0 1 0 \n",
"21903 3.292984 1 0 0 0 \n",
"21904 3.025374 1 0 0 0 \n",
"21905 3.585120 0 0 0 0 \n",
"21906 3.179655 1 0 1 0 \n",
"\n",
"[21907 rows x 14 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cps_df = pd.read_csv('data/cps.csv')\n",
"cps_df"
]
},
{
"cell_type": "markdown",
"id": "cfb44bd0",
"metadata": {},
"source": [
"statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, which is generally import as sm, we can use the following skeleton code:\n",
"\n",
" x = data[[]] \n",
" y = data[] \n",
" model = sm.OLS(y, sm.add_constant(x)) \n",
" result = model.fit() \n",
" result.summary() \n",
" \n",
"In the above code, you begin by selecting your x-variables as a DataFrame and your y-variable as a Series. You then initialize an OLS model, adding an intercept term (with `sm.add_constant()`) if necessary. Finally, you fit the OLS model and display the results. For example, below we run a regression where we estimate people's log wage (`logwage`) based on their number of years of educ (`educ`), race (`hispanic`, `black`, `asian`) and sex (`female`). Note how we deliberately do not include the sex `male` and the race `white` in our regression to avoid [linear dependency](https://stats.stackexchange.com/questions/143324/what-is-the-significance-of-a-linear-dependency-in-a-polynomial-regression). "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d43d63f9",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"OLS Regression Results\n",
"\n",
" Dep. Variable: | logwage | R-squared: | 0.250 | \n",
"
\n",
"\n",
" Model: | OLS | Adj. R-squared: | 0.250 | \n",
"
\n",
"\n",
" Method: | Least Squares | F-statistic: | 1462. | \n",
"
\n",
"\n",
" Date: | Wed, 10 Jan 2024 | Prob (F-statistic): | 0.00 | \n",
"
\n",
"\n",
" Time: | 15:13:30 | Log-Likelihood: | -19851. | \n",
"
\n",
"\n",
" No. Observations: | 21907 | AIC: | 3.971e+04 | \n",
"
\n",
"\n",
" Df Residuals: | 21901 | BIC: | 3.976e+04 | \n",
"
\n",
"\n",
" Df Model: | 5 | | | \n",
"
\n",
"\n",
" Covariance Type: | nonrobust | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" const | 1.6476 | 0.022 | 73.311 | 0.000 | 1.604 | 1.692 | \n",
"
\n",
"\n",
" educ | 0.1070 | 0.002 | 71.139 | 0.000 | 0.104 | 0.110 | \n",
"
\n",
"\n",
" hispanic | -0.0717 | 0.011 | -6.333 | 0.000 | -0.094 | -0.050 | \n",
"
\n",
"\n",
" black | -0.1250 | 0.014 | -9.249 | 0.000 | -0.152 | -0.099 | \n",
"
\n",
"\n",
" asian | -0.0041 | 0.017 | -0.244 | 0.807 | -0.037 | 0.029 | \n",
"
\n",
"\n",
" female | -0.2833 | 0.008 | -34.885 | 0.000 | -0.299 | -0.267 | \n",
"
\n",
"
\n",
"\n",
"\n",
" Omnibus: | 1131.830 | Durbin-Watson: | 1.852 | \n",
"
\n",
"\n",
" Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3713.696 | \n",
"
\n",
"\n",
" Skew: | 0.188 | Prob(JB): | 0.00 | \n",
"
\n",
"\n",
" Kurtosis: | 4.982 | Cond. No. | 82.6 | \n",
"
\n",
"
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & logwage & \\textbf{ R-squared: } & 0.250 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.250 \\\\\n",
"\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 1462. \\\\\n",
"\\textbf{Date:} & Wed, 10 Jan 2024 & \\textbf{ Prob (F-statistic):} & 0.00 \\\\\n",
"\\textbf{Time:} & 15:13:30 & \\textbf{ Log-Likelihood: } & -19851. \\\\\n",
"\\textbf{No. Observations:} & 21907 & \\textbf{ AIC: } & 3.971e+04 \\\\\n",
"\\textbf{Df Residuals:} & 21901 & \\textbf{ BIC: } & 3.976e+04 \\\\\n",
"\\textbf{Df Model:} & 5 & \\textbf{ } & \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{const} & 1.6476 & 0.022 & 73.311 & 0.000 & 1.604 & 1.692 \\\\\n",
"\\textbf{educ} & 0.1070 & 0.002 & 71.139 & 0.000 & 0.104 & 0.110 \\\\\n",
"\\textbf{hispanic} & -0.0717 & 0.011 & -6.333 & 0.000 & -0.094 & -0.050 \\\\\n",
"\\textbf{black} & -0.1250 & 0.014 & -9.249 & 0.000 & -0.152 & -0.099 \\\\\n",
"\\textbf{asian} & -0.0041 & 0.017 & -0.244 & 0.807 & -0.037 & 0.029 \\\\\n",
"\\textbf{female} & -0.2833 & 0.008 & -34.885 & 0.000 & -0.299 & -0.267 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lclc}\n",
"\\textbf{Omnibus:} & 1131.830 & \\textbf{ Durbin-Watson: } & 1.852 \\\\\n",
"\\textbf{Prob(Omnibus):} & 0.000 & \\textbf{ Jarque-Bera (JB): } & 3713.696 \\\\\n",
"\\textbf{Skew:} & 0.188 & \\textbf{ Prob(JB): } & 0.00 \\\\\n",
"\\textbf{Kurtosis:} & 4.982 & \\textbf{ Cond. No. } & 82.6 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: logwage R-squared: 0.250\n",
"Model: OLS Adj. R-squared: 0.250\n",
"Method: Least Squares F-statistic: 1462.\n",
"Date: Wed, 10 Jan 2024 Prob (F-statistic): 0.00\n",
"Time: 15:13:30 Log-Likelihood: -19851.\n",
"No. Observations: 21907 AIC: 3.971e+04\n",
"Df Residuals: 21901 BIC: 3.976e+04\n",
"Df Model: 5 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const 1.6476 0.022 73.311 0.000 1.604 1.692\n",
"educ 0.1070 0.002 71.139 0.000 0.104 0.110\n",
"hispanic -0.0717 0.011 -6.333 0.000 -0.094 -0.050\n",
"black -0.1250 0.014 -9.249 0.000 -0.152 -0.099\n",
"asian -0.0041 0.017 -0.244 0.807 -0.037 0.029\n",
"female -0.2833 0.008 -34.885 0.000 -0.299 -0.267\n",
"==============================================================================\n",
"Omnibus: 1131.830 Durbin-Watson: 1.852\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 3713.696\n",
"Skew: 0.188 Prob(JB): 0.00\n",
"Kurtosis: 4.982 Cond. No. 82.6\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = cps_df[['educ','hispanic','black','asian','female']] \n",
"y = cps_df['logwage'] \n",
"model = sm.OLS(y, sm.add_constant(x)) \n",
"result = model.fit() \n",
"result.summary() "
]
},
{
"cell_type": "markdown",
"id": "0c1467e0",
"metadata": {},
"source": [
"The regression looks good!\n",
"\n",
"For more detailed information on running various types of regressions, feel free to look at the [`Econometrics` chapter](https://aeturrell.github.io/coding-for-economists/econmt-regression.html) from the online textbook [Coding for Economists](https://aeturrell.github.io/coding-for-economists/intro.html), or various chapters from the online textbook [Causal Inference for The Brave and True](https://matheusfacure.github.io/python-causality-handbook/landing-page.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94d355c7-ffcf-4865-ac0f-056bf4ade721",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:sklearn-env]",
"language": "python",
"name": "conda-env-sklearn-env-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}