{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "9aad9273", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import statsmodels.api as sm\n", "from statsmodels.api import OLS" ] }, { "cell_type": "markdown", "id": "334763be", "metadata": {}, "source": [ "# Linear Regression" ] }, { "cell_type": "markdown", "id": "038df8c3", "metadata": {}, "source": [ "Running linear regressions with `pandas` DataFrames is easy! Let us begin by loading in dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey." ] }, { "cell_type": "code", "execution_count": 2, "id": "d4f26af6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateagewagesalimmhispanicblackasianeducwagelogwagefemalefedwkrstatewkrlocalwkr
01144180000000149.1093122.2092971100
111391800000001418.0000002.8903720000
211393560000001217.1153852.8399780001
3113980000000145.1282051.6347561000
4113910000000001638.4615383.6496590100
.............................................
21902953612500000001860.0961544.0959460010
2190395387000000011826.9230773.2929841000
2190495434820800001420.6017093.0253741000
2190595437500000001836.0576923.5851200000
2190695445000010012024.0384623.1796551010
\n", "

21907 rows × 14 columns

\n", "
" ], "text/plain": [ " state age wagesal imm hispanic black asian educ wage \\\n", "0 11 44 18000 0 0 0 0 14 9.109312 \n", "1 11 39 18000 0 0 0 0 14 18.000000 \n", "2 11 39 35600 0 0 0 0 12 17.115385 \n", "3 11 39 8000 0 0 0 0 14 5.128205 \n", "4 11 39 100000 0 0 0 0 16 38.461538 \n", "... ... ... ... ... ... ... ... ... ... \n", "21902 95 36 125000 0 0 0 0 18 60.096154 \n", "21903 95 38 70000 0 0 0 1 18 26.923077 \n", "21904 95 43 48208 0 0 0 0 14 20.601709 \n", "21905 95 43 75000 0 0 0 0 18 36.057692 \n", "21906 95 44 50000 1 0 0 1 20 24.038462 \n", "\n", " logwage female fedwkr statewkr localwkr \n", "0 2.209297 1 1 0 0 \n", "1 2.890372 0 0 0 0 \n", "2 2.839978 0 0 0 1 \n", "3 1.634756 1 0 0 0 \n", "4 3.649659 0 1 0 0 \n", "... ... ... ... ... ... \n", "21902 4.095946 0 0 1 0 \n", "21903 3.292984 1 0 0 0 \n", "21904 3.025374 1 0 0 0 \n", "21905 3.585120 0 0 0 0 \n", "21906 3.179655 1 0 1 0 \n", "\n", "[21907 rows x 14 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cps_df = pd.read_csv('data/cps.csv')\n", "cps_df" ] }, { "cell_type": "markdown", "id": "cfb44bd0", "metadata": {}, "source": [ "statsmodels is a popular Python package used to create and analyze various statistical models. To create a linear regression model in statsmodels, which is generally import as sm, we can use the following skeleton code:\n", "\n", " x = data[[]] \n", " y = data[] \n", " model = sm.OLS(y, sm.add_constant(x)) \n", " result = model.fit() \n", " result.summary() \n", " \n", "In the above code, you begin by selecting your x-variables as a DataFrame and your y-variable as a Series. You then initialize an OLS model, adding an intercept term (with `sm.add_constant()`) if necessary. Finally, you fit the OLS model and display the results. For example, below we run a regression where we estimate people's log wage (`logwage`) based on their number of years of educ (`educ`), race (`hispanic`, `black`, `asian`) and sex (`female`). Note how we deliberately do not include the sex `male` and the race `white` in our regression to avoid [linear dependency](https://stats.stackexchange.com/questions/143324/what-is-the-significance-of-a-linear-dependency-in-a-polynomial-regression). " ] }, { "cell_type": "code", "execution_count": 3, "id": "d43d63f9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: logwage R-squared: 0.250
Model: OLS Adj. R-squared: 0.250
Method: Least Squares F-statistic: 1462.
Date: Wed, 10 Jan 2024 Prob (F-statistic): 0.00
Time: 15:13:30 Log-Likelihood: -19851.
No. Observations: 21907 AIC: 3.971e+04
Df Residuals: 21901 BIC: 3.976e+04
Df Model: 5
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 1.6476 0.022 73.311 0.000 1.604 1.692
educ 0.1070 0.002 71.139 0.000 0.104 0.110
hispanic -0.0717 0.011 -6.333 0.000 -0.094 -0.050
black -0.1250 0.014 -9.249 0.000 -0.152 -0.099
asian -0.0041 0.017 -0.244 0.807 -0.037 0.029
female -0.2833 0.008 -34.885 0.000 -0.299 -0.267
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1131.830 Durbin-Watson: 1.852
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3713.696
Skew: 0.188 Prob(JB): 0.00
Kurtosis: 4.982 Cond. No. 82.6


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & logwage & \\textbf{ R-squared: } & 0.250 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.250 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 1462. \\\\\n", "\\textbf{Date:} & Wed, 10 Jan 2024 & \\textbf{ Prob (F-statistic):} & 0.00 \\\\\n", "\\textbf{Time:} & 15:13:30 & \\textbf{ Log-Likelihood: } & -19851. \\\\\n", "\\textbf{No. Observations:} & 21907 & \\textbf{ AIC: } & 3.971e+04 \\\\\n", "\\textbf{Df Residuals:} & 21901 & \\textbf{ BIC: } & 3.976e+04 \\\\\n", "\\textbf{Df Model:} & 5 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{const} & 1.6476 & 0.022 & 73.311 & 0.000 & 1.604 & 1.692 \\\\\n", "\\textbf{educ} & 0.1070 & 0.002 & 71.139 & 0.000 & 0.104 & 0.110 \\\\\n", "\\textbf{hispanic} & -0.0717 & 0.011 & -6.333 & 0.000 & -0.094 & -0.050 \\\\\n", "\\textbf{black} & -0.1250 & 0.014 & -9.249 & 0.000 & -0.152 & -0.099 \\\\\n", "\\textbf{asian} & -0.0041 & 0.017 & -0.244 & 0.807 & -0.037 & 0.029 \\\\\n", "\\textbf{female} & -0.2833 & 0.008 & -34.885 & 0.000 & -0.299 & -0.267 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 1131.830 & \\textbf{ Durbin-Watson: } & 1.852 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.000 & \\textbf{ Jarque-Bera (JB): } & 3713.696 \\\\\n", "\\textbf{Skew:} & 0.188 & \\textbf{ Prob(JB): } & 0.00 \\\\\n", "\\textbf{Kurtosis:} & 4.982 & \\textbf{ Cond. No. } & 82.6 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: logwage R-squared: 0.250\n", "Model: OLS Adj. R-squared: 0.250\n", "Method: Least Squares F-statistic: 1462.\n", "Date: Wed, 10 Jan 2024 Prob (F-statistic): 0.00\n", "Time: 15:13:30 Log-Likelihood: -19851.\n", "No. Observations: 21907 AIC: 3.971e+04\n", "Df Residuals: 21901 BIC: 3.976e+04\n", "Df Model: 5 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 1.6476 0.022 73.311 0.000 1.604 1.692\n", "educ 0.1070 0.002 71.139 0.000 0.104 0.110\n", "hispanic -0.0717 0.011 -6.333 0.000 -0.094 -0.050\n", "black -0.1250 0.014 -9.249 0.000 -0.152 -0.099\n", "asian -0.0041 0.017 -0.244 0.807 -0.037 0.029\n", "female -0.2833 0.008 -34.885 0.000 -0.299 -0.267\n", "==============================================================================\n", "Omnibus: 1131.830 Durbin-Watson: 1.852\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 3713.696\n", "Skew: 0.188 Prob(JB): 0.00\n", "Kurtosis: 4.982 Cond. No. 82.6\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = cps_df[['educ','hispanic','black','asian','female']] \n", "y = cps_df['logwage'] \n", "model = sm.OLS(y, sm.add_constant(x)) \n", "result = model.fit() \n", "result.summary() " ] }, { "cell_type": "markdown", "id": "0c1467e0", "metadata": {}, "source": [ "The regression looks good!\n", "\n", "For more detailed information on running various types of regressions, feel free to look at the [`Econometrics` chapter](https://aeturrell.github.io/coding-for-economists/econmt-regression.html) from the online textbook [Coding for Economists](https://aeturrell.github.io/coding-for-economists/intro.html), or various chapters from the online textbook [Causal Inference for The Brave and True](https://matheusfacure.github.io/python-causality-handbook/landing-page.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "94d355c7-ffcf-4865-ac0f-056bf4ade721", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sklearn-env]", "language": "python", "name": "conda-env-sklearn-env-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }