import pandas as pd
Syntax#
The online textbook Coding for Economists has a great section summarizing the differences in syntax between Stata and Python. This chapter does not aim to replace the key material included there. Instead, I hope it can serve as a supplement for those transitioning from Stata to Python.
This chapter will aim to describe stata2python
, a Python package where you can write Stata commands and recieve the equivalent code for them in Python.
stata2python#
Installation and Import#
To install stata2python
, you can simply use the package installer pip
. An example is shown below. To do so, simply run the following command in your terminal.
pip install --upgrade stata2python
After you have installed the package, you can simply import it into your Jupyter notebook. Alternatively, you can also access a Python shell ( by typing python3
in your terminal) and begin using it. To import the package in your notebook/shell, the following syntax is encouraged.
from stata2python import stata2python
We import the package below.
from stata2python import stata2python
We also import in sample NBA, (log) wage and pollution datasets that will be helpful for demonstrating the features of our package.
nba = pd.read_csv("data/nba.csv") # NBA dataset
nba.head()
married | wage | exper | age | coll | games | minutes | guard | forward | center | points | rebounds | assists | draft | allstar | avgmin | black | children | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1002.5 | 4 | 27 | 4 | 77 | 2867 | 1 | 0 | 0 | 16 | 4 | 5 | 19.0 | 0 | 37.23 | 1 | 0 |
1 | 1 | 2030.0 | 5 | 28 | 4 | 78 | 2789 | 1 | 0 | 0 | 13 | 3 | 9 | 28.0 | 0 | 35.76 | 1 | 1 |
2 | 0 | 650.0 | 1 | 25 | 4 | 74 | 1149 | 0 | 0 | 1 | 6 | 3 | 0 | 19.0 | 0 | 15.53 | 1 | 0 |
3 | 0 | 2030.0 | 5 | 28 | 4 | 47 | 1178 | 0 | 1 | 0 | 7 | 5 | 2 | 1.0 | 0 | 25.06 | 1 | 0 |
4 | 0 | 755.0 | 3 | 24 | 4 | 82 | 2096 | 1 | 0 | 0 | 11 | 4 | 3 | 24.0 | 0 | 25.56 | 1 | 0 |
wages = pd.read_csv("data/la.csv") # (Log) Wages dataset
wages.head()
hispanic | citizen | black | exp | wage | female | education | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 14.0 | 5.288462 | 1 | 9 |
1 | 0 | 1 | 0 | 14.7 | 8.461538 | 1 | 13 |
2 | 0 | 1 | 0 | 14.7 | 10.416667 | 1 | 13 |
3 | 0 | 1 | 0 | 14.0 | 21.634615 | 1 | 14 |
4 | 1 | 0 | 0 | 12.0 | 3.365385 | 1 | 12 |
pollution = pd.read_csv("data/pollution.csv") # Pollution dataset
pollution.head()
year | countryname | countrycode | gdp | gdppc | co2 | co2pc | population | oecd | |
---|---|---|---|---|---|---|---|---|---|
0 | 2010 | Zambia | ZMB | 9.799629e+09 | 741.4421 | 2427.554 | 0.183669 | 13216985 | 0.0 |
1 | 2010 | French Polynesia | PYF | NaN | NaN | 883.747 | 3.296764 | 268065 | 0.0 |
2 | 2010 | Monaco | MCO | NaN | NaN | NaN | NaN | 36845 | 0.0 |
3 | 2010 | Ukraine | UKR | 9.057726e+10 | 1974.6212 | 304804.720 | 6.644867 | 45870700 | 0.0 |
4 | 2010 | Venezuela, RB | VEN | 1.750000e+11 | 6010.0270 | 201747.340 | 6.946437 | 29043283 | 0.0 |
Usage#
Currently, stata2python
only supports commands necessary to teach an introductory course in econometrics. If you would like to contribute, feel free to create pull requests here. Below, we discuss the commands currently supported by the package.
By importing stata2python
via the command from stata2python import stata2python
, you can access a function named stata2python
. You can enter in any Stata command as a string to this function. If the command is supported by stata2python
, the function will output Python code equivalent to the Stata command.
Optionally, you may also specify the name of the DataFrame you’re working with via the df_name
parameter in stata2python
. The default value for df_name
is simply df
.
Note that the provided Python code will (naturally) require you to have pandas
installed, and may also require you to have the Python packages numpy
, scipy
, matplotlib
, and statsmodels
installed. The provided code will include all imports necessary (including import pandas as pd
).
Below, we discuss all the features currently supported by Stata2Python
, along with providing example usages.
Generating new columns#
Used for generating new columns. For example,
stata2python("gen degree = (coll >= 4)")
import pandas as pd
df['degree'] = df['coll'] >= 4
stata2python("gen productivity = points/(minutes/games)","nba")
import pandas as pd
nba['productivity'] = nba['points']/(nba['minutes']/nba['games'])
Assuming you have all the correct packages installed, you can directly copy paste this code to see the output. For example,
import pandas as pd
nba['productivity'] = nba['points']/(nba['minutes']/nba['games'])
nba # You can see that the productivity column was successfully generated
married | wage | exper | age | coll | games | minutes | guard | forward | center | points | rebounds | assists | draft | allstar | avgmin | black | children | productivity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1002.5 | 4 | 27 | 4 | 77 | 2867 | 1 | 0 | 0 | 16 | 4 | 5 | 19.0 | 0 | 37.23 | 1 | 0 | 0.429717 |
1 | 1 | 2030.0 | 5 | 28 | 4 | 78 | 2789 | 1 | 0 | 0 | 13 | 3 | 9 | 28.0 | 0 | 35.76 | 1 | 1 | 0.363571 |
2 | 0 | 650.0 | 1 | 25 | 4 | 74 | 1149 | 0 | 0 | 1 | 6 | 3 | 0 | 19.0 | 0 | 15.53 | 1 | 0 | 0.386423 |
3 | 0 | 2030.0 | 5 | 28 | 4 | 47 | 1178 | 0 | 1 | 0 | 7 | 5 | 2 | 1.0 | 0 | 25.06 | 1 | 0 | 0.279287 |
4 | 0 | 755.0 | 3 | 24 | 4 | 82 | 2096 | 1 | 0 | 0 | 11 | 4 | 3 | 24.0 | 0 | 25.56 | 1 | 0 | 0.430344 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
264 | 1 | 3210.0 | 7 | 29 | 4 | 79 | 2638 | 1 | 0 | 0 | 20 | 3 | 3 | 11.0 | 1 | 33.39 | 1 | 0 | 0.598939 |
265 | 1 | 715.0 | 5 | 31 | 4 | 75 | 1084 | 0 | 1 | 0 | 5 | 3 | 1 | 54.0 | 0 | 14.45 | 1 | 1 | 0.345941 |
266 | 1 | 600.0 | 11 | 33 | 3 | 67 | 1197 | 1 | 0 | 0 | 10 | 2 | 2 | 4.0 | 0 | 17.87 | 1 | 1 | 0.559733 |
267 | 0 | 2500.0 | 6 | 28 | 4 | 78 | 2113 | 0 | 0 | 1 | 16 | 6 | 2 | 2.0 | 0 | 27.09 | 0 | 0 | 0.590629 |
268 | 0 | 2000.0 | 12 | 33 | 3 | 30 | 282 | 0 | 1 | 0 | 2 | 3 | 1 | 5.0 | 0 | 9.40 | 1 | 0 | 0.212766 |
269 rows × 19 columns
Describing the Data#
Used for summarizing the dataframe. For example,
stata2python("describe","nba")
import pandas as pd
nba.describe()
Verifying that the output works:
import pandas as pd
nba.describe()
married | wage | exper | age | coll | games | minutes | guard | forward | center | points | rebounds | assists | draft | allstar | avgmin | black | children | productivity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 240.00000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 | 269.000000 |
mean | 0.442379 | 1423.827509 | 5.118959 | 27.394052 | 3.717472 | 65.724907 | 1682.193309 | 0.420074 | 0.408922 | 0.171004 | 10.260223 | 4.468401 | 2.453532 | 20.20000 | 0.115242 | 23.979257 | 0.806691 | 0.345725 | 0.409677 |
std | 0.497595 | 999.774074 | 3.400062 | 3.391292 | 0.754410 | 18.851110 | 893.327771 | 0.494491 | 0.492551 | 0.377214 | 5.882489 | 2.892980 | 2.148124 | 18.73582 | 0.319909 | 9.731086 | 0.395629 | 0.476491 | 0.114920 |
min | 0.000000 | 150.000000 | 1.000000 | 21.000000 | 0.000000 | 3.000000 | 33.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.00000 | 0.000000 | 2.890000 | 0.000000 | 0.000000 | 0.093851 |
25% | 0.000000 | 650.000000 | 2.000000 | 25.000000 | 4.000000 | 57.000000 | 983.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2.000000 | 1.000000 | 7.00000 | 0.000000 | 16.730000 | 1.000000 | 0.000000 | 0.338710 |
50% | 0.000000 | 1186.000000 | 4.000000 | 27.000000 | 4.000000 | 74.000000 | 1690.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 4.000000 | 2.000000 | 14.50000 | 0.000000 | 24.820000 | 1.000000 | 0.000000 | 0.408560 |
75% | 1.000000 | 2014.500000 | 7.000000 | 30.000000 | 4.000000 | 79.000000 | 2438.000000 | 1.000000 | 1.000000 | 0.000000 | 14.000000 | 6.000000 | 3.000000 | 28.25000 | 0.000000 | 33.260000 | 1.000000 | 1.000000 | 0.471168 |
max | 1.000000 | 5740.000000 | 18.000000 | 41.000000 | 4.000000 | 82.000000 | 3533.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | 17.000000 | 13.000000 | 139.00000 | 1.000000 | 43.090000 | 1.000000 | 1.000000 | 0.748532 |
Correlation Matrix#
Used for generating a correlation matrix between relevant variables. For example,
stata2python("corr points assists rebounds", "nba")
import pandas as pd
nba[['points', 'assists', 'rebounds']].corr()
import pandas as pd # Verifying that the output works
nba[['points', 'assists', 'rebounds']].corr()
points | assists | rebounds | |
---|---|---|---|
points | 1.000000 | 0.539269 | 0.563324 |
assists | 0.539269 | 1.000000 | 0.059956 |
rebounds | 0.563324 | 0.059956 | 1.000000 |
Scatter Plots#
Used for making scatter (or twoway) plots. For example,
stata2python("twoway (scatter co2pc population, xtitle('co2pc') ytitle('population'))","pollution")
import pandas as pd
import matplotlib.pyplot as plt
plt.scatter(pollution['co2pc'], pollution['population']);
plt.xlabel('co2pc');
plt.ylabel('population');
import pandas as pd # Verifying that the output works
import matplotlib.pyplot as plt
plt.scatter(pollution['co2pc'], pollution['population']);
plt.xlabel('co2pc');
plt.ylabel('population');

Histograms#
Used for generating histograms. For example,
stata2python("histogram co2pc, bin(80)", "pollution")
import pandas as pd
import matplotlib.pyplot as plt
pollution.hist(column='co2pc',bins=80);
import pandas as pd # Verifying that the output works
import matplotlib.pyplot as plt
pollution.hist(column='co2pc',bins=80);

T-tests#
This function helps users determine the code for running t-tests in Python. Examples include:
stata2python("ttest wage, by(guard)")
import pandas as pd
import numpy as np
from scipy import stats
### First, we must filter the DataFrame to obtain the right values
catvar_vals = np.unique(df['guard'])
df_1 = df[df['guard'] == catvar_vals[0]]
df_2 = df[df['guard'] == catvar_vals[1]]
### Then, we can run our t-test
stats.ttest_ind(df_1['wage'], df_2['wage'], equal_var=True, nan_policy='propagate')
stata2python("ttest wage, by(guard) unequal", "nba")
import pandas as pd
import numpy as np
from scipy import stats
### First, we must filter the DataFrame to obtain the right values
catvar_vals = np.unique(nba['guard'])
df_1 = nba[nba['guard'] == catvar_vals[0]]
df_2 = nba[nba['guard'] == catvar_vals[1]]
### Then, we can run our t-test
stats.ttest_ind(df_1['wage'], df_2['wage'], equal_var=False, nan_policy='propagate')
Verifying that the output works:
import pandas as pd
import numpy as np
from scipy import stats
### First, we must filter the DataFrame to obtain the right values
catvar_vals = np.unique(nba['guard'])
df_1 = nba[nba['guard'] == catvar_vals[0]]
df_2 = nba[nba['guard'] == catvar_vals[1]]
### Then, we can run our t-test
stats.ttest_ind(df_1['wage'], df_2['wage'], equal_var=False, nan_policy='propagate')
TtestResult(statistic=2.1432820571177977, pvalue=0.03299634994484977, df=266.3682612357414)
Regressions#
Used for running regressions, including Python comments to help explain relevant portions of the code. Note that not all features from Stata’s reg
command are supported. Examples include:
stata2python("reg wage exp", "wages")
import pandas as pd
import statsmodels.api as sm
### Dropping NaN/missing values
wages_no_na = wages[['exp','wage']].dropna()
### Below, we extract the relevant variables from the DataFrame
x_df = wages_no_na['exp']
y_df = wages_no_na['wage']
### We now define the model, fit it to the data and then view a summary of the results
model = sm.OLS(y_df, sm.add_constant(x_df))
result = model.fit()
result.summary()
import pandas as pd # Verifying that the output works
import statsmodels.api as sm
### Dropping NaN/missing values
wages_no_na = wages[['exp','wage']].dropna()
### Below, we extract the relevant variables from the DataFrame
x_df = wages_no_na['exp']
y_df = wages_no_na['wage']
### We now define the model, fit it to the data and then view a summary of the results
model = sm.OLS(y_df, sm.add_constant(x_df))
result = model.fit()
result.summary()
Dep. Variable: | wage | R-squared: | 0.001 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.001 |
Method: | Least Squares | F-statistic: | 0.4714 |
Date: | Sun, 04 Feb 2024 | Prob (F-statistic): | 0.493 |
Time: | 21:26:39 | Log-Likelihood: | -3541.0 |
No. Observations: | 863 | AIC: | 7086. |
Df Residuals: | 861 | BIC: | 7096. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 16.0816 | 3.917 | 4.105 | 0.000 | 8.393 | 23.771 |
exp | -0.2135 | 0.311 | -0.687 | 0.493 | -0.824 | 0.397 |
Omnibus: | 1263.062 | Durbin-Watson: | 1.830 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 413853.334 |
Skew: | 8.176 | Prob(JB): | 0.00 |
Kurtosis: | 109.028 | Cond. No. | 99.5 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
stata2python("reg wage exp, vce(cluster education)")
import pandas as pd
import statsmodels.api as sm
### We first perform some string manipulations to ensure we are appropripately accounting for strings.
### This section seems complicated as this code is meant to be a general-purpose code for converting the Stata commands to Python.
### You can greatly simplify this step if you know exactly which variables you are interested in; simply extract those variables from the DataFrame directly.
y_var = "wage"
clustering_vars = ['education']
### Dropping NaN/missing values
df_no_na = df[['exp', 'education','wage']].dropna()
### Below, we extract the relevant variables from the DataFrame
x_df = df_no_na['exp']
y_df = df_no_na['wage']
### We now define the model, fit it to the data and then view a summary of the results
model = sm.OLS(y_df, sm.add_constant(x_df))
result = model.fit(cov_type='cluster', cov_kwds={'groups': df_no_na['education']})
result.summary()
stata2python("reg wage exp i.female, vce(cluster education)", "wages")
import pandas as pd
import statsmodels.api as sm
### We first perform some string manipulations to ensure we are appropripately accounting for strings.
### This section seems complicated as this code is meant to be a general-purpose code for converting the Stata commands to Python.
### You can greatly simplify this step if you know exactly which variables you are interested in; simply extract those variables from the DataFrame directly.
y_var = "wage"
clustering_vars = ['education']
wages_with_dummies = wages.copy()
x_var = "'exp', 'i.female'"
x_var_new = "'" + "', '".join([i.strip("'") for i in x_var.split(", ") if not i.strip("'").startswith("i.")]) + "'"
### The below section adds in relevant indicator variables, ensuring they have interpretable names
indicator_list = [m.strip("'") for m in x_var.split('i.')[1:]]
for indicator in indicator_list:
dummies = pd.get_dummies(wages_with_dummies[indicator])
dummies = dummies.iloc[:,1:]
dummies.columns = [str(x) + '_' + str(indicator) for x in dummies.columns]
wages_with_dummies = pd.concat([wages_with_dummies,dummies],axis=1)
x_var_new = x_var_new + ", '" + "', '".join(dummies.columns) + "'"
x_var = x_var_new
### This helps ensure the clustered variables are extracted from the DataFrame
if clustering_vars:
x_var_temp = x_var + ", '" + "', '".join(clustering_vars) + "'"
var_list = x_var_temp +", '"+ str(y_var) + "'"
### Dropping NaN/missing values
wages_with_dummies_no_na = wages_with_dummies[[i.strip("'") for i in var_list.split(", ")]].dropna()
### Below, we extract the relevant variables from the DataFrame
x_df = wages_with_dummies_no_na[[i.strip("'") for i in x_var.split(", ")]]
y_df = wages_with_dummies_no_na['wage']
### We now define the model, fit it to the data and then view a summary of the results
model = sm.OLS(y_df, sm.add_constant(x_df))
result = model.fit(cov_type='cluster', cov_kwds={'groups': wages_with_dummies_no_na['education']})
result.summary()
import pandas as pd # Verifying that the output works
import statsmodels.api as sm
### We first perform some string manipulations to ensure we are appropripately accounting for strings.
### This section seems complicated as this code is meant to be a general-purpose code for converting the Stata commands to Python.
### You can greatly simplify this step if you know exactly which variables you are interested in; simply extract those variables from the DataFrame directly.
y_var = "wage"
clustering_vars = ['education']
wages_with_dummies = wages.copy()
x_var = "'exp', 'i.female'"
x_var_new = "'" + "', '".join([i.strip("'") for i in x_var.split(", ") if not i.strip("'").startswith("i.")]) + "'"
### The below section adds in relevant indicator variables, ensuring they have interpretable names
indicator_list = [m.strip("'") for m in x_var.split('i.')[1:]]
for indicator in indicator_list:
dummies = pd.get_dummies(wages_with_dummies[indicator])
dummies = dummies.iloc[:,1:]
dummies.columns = [str(x) + '_' + str(indicator) for x in dummies.columns]
wages_with_dummies = pd.concat([wages_with_dummies,dummies],axis=1)
x_var_new = x_var_new + ", '" + "', '".join(dummies.columns) + "'"
x_var = x_var_new
### This helps ensure the clustered variables are extracted from the DataFrame
if clustering_vars:
x_var_temp = x_var + ", '" + "', '".join(clustering_vars) + "'"
var_list = x_var_temp +", '"+ str(y_var) + "'"
### Dropping NaN/missing values
wages_with_dummies_no_na = wages_with_dummies[[i.strip("'") for i in var_list.split(", ")]].dropna()
### Below, we extract the relevant variables from the DataFrame
x_df = wages_with_dummies_no_na[[i.strip("'") for i in x_var.split(", ")]]
y_df = wages_with_dummies_no_na['wage']
### We now define the model, fit it to the data and then view a summary of the results
model = sm.OLS(y_df, sm.add_constant(x_df))
result = model.fit(cov_type='cluster', cov_kwds={'groups': wages_with_dummies_no_na['education']})
result.summary()
Dep. Variable: | wage | R-squared: | 0.007 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.005 |
Method: | Least Squares | F-statistic: | 1.993 |
Date: | Sun, 04 Feb 2024 | Prob (F-statistic): | 0.183 |
Time: | 21:26:54 | Log-Likelihood: | -3538.0 |
No. Observations: | 863 | AIC: | 7082. |
Df Residuals: | 860 | BIC: | 7096. |
Df Model: | 2 | ||
Covariance Type: | cluster |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 17.4218 | 3.698 | 4.712 | 0.000 | 10.175 | 24.669 |
exp | -0.2373 | 0.206 | -1.153 | 0.249 | -0.640 | 0.166 |
1_female | -2.4604 | 1.261 | -1.951 | 0.051 | -4.932 | 0.011 |
Omnibus: | 1260.198 | Durbin-Watson: | 1.842 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 409372.994 |
Skew: | 8.142 | Prob(JB): | 0.00 |
Kurtosis: | 108.449 | Cond. No. | 101. |
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)