import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import plotly.graph_objects as go

Basic Visualizations#

In thus subchapter, we will go over some key types of plots, as well as how to create them using two of the most common plotting libraries in Python: matplotlib and seaborn. However, first we will need to load in the dataset we will be using for this analysis.

For this analysis, we will be loading in a subset of World Bank’s Education Statistics (EdStats) Dataset. The subset we will be working with has data on a bunch of education metrics/indicators for 25 regions from 1970-2014. We have cleaned this dataset to ensure we don’t have any missing datapoints. There are too many indicators to even begin describing all. We highly encourage you to explore World Bank’s data directly if you’re interested!

wb_df = pd.read_csv('data/wb_educ.csv')
wb_df
Country Name Country Code Indicator Name Indicator Code 1970 1971 1972 1973 1974 1975 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
0 Arab World ARB Adjusted net enrolment rate, primary, both sex... SE.PRM.TENR 5.482212e+01 5.489414e+01 5.620944e+01 5.726711e+01 5.799114e+01 5.936554e+01 ... 8.248949e+01 8.268551e+01 8.328034e+01 8.401187e+01 8.419596e+01 8.521200e+01 8.524514e+01 8.610167e+01 8.551194e+01 8.532015e+01
1 Arab World ARB Adjusted net enrolment rate, primary, female (%) SE.PRM.TENR.FE 4.335110e+01 4.331815e+01 4.464070e+01 4.584572e+01 4.644950e+01 4.836389e+01 ... 7.947577e+01 7.960833e+01 8.058242e+01 8.147380e+01 8.169569e+01 8.287165e+01 8.286139e+01 8.440141e+01 8.391403e+01 8.382083e+01
2 Arab World ARB Adjusted net enrolment rate, primary, gender p... UIS.NERA.1.GPI 6.585700e-01 6.564000e-01 6.632900e-01 6.720400e-01 6.726100e-01 6.917600e-01 ... 9.308400e-01 9.296200e-01 9.384600e-01 9.425000e-01 9.434700e-01 9.476200e-01 9.467000e-01 9.620800e-01 9.640900e-01 9.662000e-01
3 Arab World ARB Adjusted net enrolment rate, primary, male (%) SE.PRM.TENR.MA 6.582623e+01 6.599358e+01 6.730186e+01 6.821908e+01 6.905901e+01 6.991455e+01 ... 8.538029e+01 8.563508e+01 8.586669e+01 8.644414e+01 8.659069e+01 8.745258e+01 8.752652e+01 8.772817e+01 8.703988e+01 8.675339e+01
4 Arab World ARB Adjusted net intake rate to Grade 1 of primary... UIS.NIRA.1 5.244892e+01 5.248975e+01 5.263559e+01 5.332700e+01 5.418465e+01 5.486563e+01 ... 7.613177e+01 7.522256e+01 7.559570e+01 7.489394e+01 7.481455e+01 7.619704e+01 7.647497e+01 7.731937e+01 7.656671e+01 7.662057e+01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3395 World WLD Survival rate to the last grade of primary edu... SE.PRM.PRSL.MA.ZS 6.707542e+01 6.657861e+01 6.629349e+01 6.593887e+01 6.625532e+01 6.682665e+01 ... 6.825568e+01 6.920012e+01 6.931534e+01 7.040000e+01 6.939960e+01 6.849098e+01 6.795392e+01 6.678090e+01 6.694540e+01 7.019475e+01
3396 World WLD Teachers in primary education, both sexes (num... SE.PRM.TCHR 1.428878e+07 1.459513e+07 1.529481e+07 1.582020e+07 1.627415e+07 1.674470e+07 ... 2.688926e+07 2.709888e+07 2.752526e+07 2.777079e+07 2.814018e+07 2.864318e+07 2.910380e+07 2.938328e+07 3.047234e+07 3.067875e+07
3397 World WLD Teachers in primary education, female (number) SE.PRM.TCHR.FE 7.874428e+06 8.001568e+06 8.262296e+06 8.498361e+06 8.663721e+06 8.810588e+06 ... 1.665298e+07 1.679720e+07 1.703848e+07 1.723032e+07 1.753421e+07 1.796218e+07 1.832742e+07 1.859735e+07 1.931428e+07 1.952334e+07
3398 World WLD Teachers in secondary education, both sexes (n... SE.SEC.TCHR 1.172561e+07 1.213650e+07 1.269944e+07 1.330712e+07 1.374928e+07 1.413025e+07 ... 2.843753e+07 2.910475e+07 3.010238e+07 3.105815e+07 3.162800e+07 3.221992e+07 3.249828e+07 3.269860e+07 3.235300e+07 3.286152e+07
3399 World WLD Teachers in tertiary education programmes, bot... SE.TER.TCHR 2.752715e+06 2.815567e+06 2.938241e+06 3.047319e+06 3.181207e+06 3.332724e+06 ... 9.191711e+06 9.480116e+06 1.000710e+07 1.054032e+07 1.069500e+07 1.113321e+07 1.158810e+07 1.190016e+07 1.216735e+07 1.264785e+07

3400 rows × 49 columns

Matplotlib and Seaborn#

Matplotlib and Seaborn are both plotting libraries useful for making graphs in Python. They both support a variety of different types of plots, and we shall give examples of various plots in each below. We have imported both the libraries at the top of the subchapter. Importantly, in order to view matplotlib plots in Jupyter notebooks, you have to include the %matplotlib inline command we used above (this is a line magic command used by Jupyter notebooks, read more here). As always, you are encouraged to read Matplotlib’s and Seaborn’s documentation.

Fun fact: seaborn is actually built on top of matplotlib to make creating basic plots simpler and require less code. All seaborn plots are by definition also matplotlib plots. For example, when you’re plotting two variables from the same DataFrame against each other:

  • In matplotlib, you have to pull both of them out as Series and pass them into the function call. For example: plt.scatter(x = df['x_name'], y = df['y_name'].

  • In seaborn, you can simply pass in the DataFrame name and then the names of the variables. For example: sns.scatter(data = df, x = 'x_name', y = 'y_name'.

Small things like this can help save a lot of time when producing quick plots. However, seaborn is less customizable than matplotlib and cannot produce all the plots matplotlib can.

Matplotlib Plot Customizations#

Regardless of the type of plot and the library you choose to create the plot, you should always give it a title, set axis labels, etc. There are several miscellaneous function calls you can use to set these settings regardless of plot. As seaborn is built on top of matplotlib, these commands don’t differ between these two packages. We have listed a few of these commands below, but the list is not at all comprehensive. Read here for more details.

Note: The list above assumes you have imported matplotlib.pyplot as plt. If not, you will need to replace plt with matplotlib.pyplot (or whatever other alias you choose to use).

Bar Charts#

Although our dataset has complete information for several metrics for each of the 25 regions, it doesn’t have data on the exact same metrics for each region. We have a lot more data from some regions than from others, as can be seen from the .value_counts() below.

wb_df['Country Name'].value_counts()
Latin America & Caribbean (excluding high income)     176
Latin America & Caribbean                             176
Lower middle income                                   174
World                                                 166
Euro area                                             166
Middle East & North Africa                            166
Middle East & North Africa (excluding high income)    165
South Asia                                            164
High income                                           163
Middle income                                         156
Arab World                                            143
Low & middle income                                   143
Heavily indebted poor countries (HIPC)                137
Least developed countries: UN classification          135
European Union                                        134
Europe & Central Asia                                 133
Low income                                            128
OECD members                                          128
Sub-Saharan Africa                                    124
Sub-Saharan Africa (excluding high income)            124
Upper middle income                                    89
East Asia & Pacific                                    86
North America                                          84
Europe & Central Asia (excluding high income)          73
East Asia & Pacific (excluding high income)            67
Name: Country Name, dtype: int64

Let us attempt to visualize this. As we have a qualitative nominal and a quantitative discrete variable here, a bar chart would be helpful. Bar charts display a bar for each category, and the length of each bar is proportional to the frequency of the corresponding category. While not necessary, most bar charts have equally spaced and equally wide columns.

We have made a sample bar chart below using matplotlib below; as always, read the documentation for more information.

plt.bar(wb_df['Country Name'].value_counts().index, wb_df['Country Name'].value_counts())
plt.title('Number of Metrics for each Region')
plt.xlabel('Region Name')
plt.ylabel('Count');
../../_images/96b98ca0b0b4faef86d997969705bb4c6e14f03a43068414c8b078f827c67dc3.png

It is impossible to read any of the labels on the x-axis of the above bar chart as they all overlap. This is an example of overplotting, which is defined as data/labels on a plot overlapping, making it difficult to see what’s going to. To fix this, we could rotate the x-axis labels so they’re vertical, as done below.

plt.bar(wb_df['Country Name'].value_counts().index, wb_df['Country Name'].value_counts())
plt.xticks(rotation=90)
plt.title('Number of Metrics for each Region')
plt.xlabel('Region Name')
plt.ylabel('Count');
../../_images/b0b18eb6b965b216b571936c11ed255e5296695b288454e6f036b767670ead8f.png

Alternatively, you can easily make this plot in seaborn using the .countplot() command. As you can see, seaborn automatically gave the plot a color scheme and x and y-axis labels (matplotlib would not do this without us specifying the labels). However, we still have to use plt.xticks() to avoid overplotting.

sns.countplot(data = wb_df, x = 'Country Name');
plt.xticks(rotation=90);
../../_images/a165935cbb8b884f64f8619292192c8f5597be2d14e796248e2a220cae4806e2.png

Another way to avoid overplotting would be to make a horizontal bar chart instead. We have done so using seaborn below.

sns.countplot(data = wb_df, y = 'Country Name')
plt.title('Number of Metrics for each Region')
plt.xlabel('Region Name')
plt.ylabel('Count');
../../_images/2071a29ba1ae6c0b2b6b3de62c3590474887ee4847774b14cd572d334be09426.png

Finally, if you would like to also sort these values from highest to lowest (similar to how the .value_counts() command sorted the matplotlib bar chart in descending order), you can use the order parameter.

sns.countplot(data = wb_df, y = 'Country Name', order = wb_df['Country Name'].value_counts().index)
plt.title('Number of Metrics for each Region')
plt.xlabel('Region Name')
plt.ylabel('Count');
../../_images/2c6c6760c02518cbfb637aa5e15b92c0b6e1f06c5ad962dfd7de53d09fcbab79.png

Scatter Plots#

Scatter plots are used to visualize the relationship between two numerical variables. They help us infer the association between two variables. The association between two variables refers to how one variable changes with respect to the other. We can describe the association between two variables based on two factors:

  1. Magnitude: Is the association strong or weak? If the points on the scatter plot all line up along a straight line (in any direction), it means that the association between the variables is strong. On the other hand, if the points are all spread out and scattered (no pun intended), it means that the association is weak.

  2. Direction (or sign): Is the association positive or negative? If one variable increases as the other variable increases, the association between the two variables is positive. If one decreases as the other increases, the association is negative.

For example, let us say we’re interested in plotting the percent of males vs females enrolled in primary school each year across the entire world, using data from 1970 to 2014. We have done this using matplotlib below. As you can see, while it’s easy to extract columns as series, in order to plot row-wise data, we must extract individual rows and then turn them into NumPy arrays.

males = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, male (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:,4:])[0]
females = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, female (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:,4:])[0]
plt.scatter(x = males, y = females);
plt.title('Proportion enrolled in Primary School, using data from 1970 to 2014')
plt.xlabel('Percent of Males')
plt.ylabel('Percent of Females');
../../_images/9331ca79b48594a0ce462499619bd45076dae652a57acf0fd22ad46d53515dbb.png

We could’ve also made this plot with seaborn.

males = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, male (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:,4:])[0]
females = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, female (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:,4:])[0]
sns.scatterplot(x = males, y = females);
plt.title('Proportion enrolled in Primary School, using data from 1970 to 2014')
plt.xlabel('Percent of Males')
plt.ylabel('Percent of Females');
../../_images/58e0ec5889c0e544ace9033aa2bf8227783fb394fd4ef5f72433a801f207e0eb.png

Finally, to add some context to our visualization, let us also plot a y = x line, to represent where the data should be if there was gender quality in primary school enrollment. Notice how we have also added a legend here.

males = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, male (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:,4:])[0]
females = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, female (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:,4:])[0]
sns.scatterplot(x = males, y = females, label = 'Global Data from 1970 to 2014');
plt.title('Proportion enrolled in Primary School, using data from 1970 to 2014')
plt.xlabel('Percent of Males')
plt.ylabel('Percent of Females');
min_val = min(np.min(males), np.min(females))
max_val = max(np.max(males), np.max(females))
plt.plot([min_val, max_val], [min_val, max_val], color = 'red', label = 'y = x line');
plt.legend();
../../_images/67966c25830fbab2e6166b7615450b02bf583b7ee669f0f23f68a9be5275ea92.png

Based on the scatter plots above, it appears that there is a positive association between the percent of males and the percent of females enrolled in primary school. As the association roughly follows a straight-ish line, we can say that the association appears to be strong.

As a bonus, we can use the more advanced Plotly library to make Python plots where you can hover over data points to get more information on the general trend. This has been done below as an example, and you are always encouraged to read Plotly’s documentation for more information.

males = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, male (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:, 4:])[0]
females = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, female (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:, 4:])[0]

# Create a Figure
fig = go.Figure()

# Add a trace for with the scatter plot data points
fig.add_trace(go.Scatter(
    x=males,
    y=females,
    mode='markers',
    marker=dict(color='blue'),
    text=[str(year) for year in wb_df.iloc[:, 4:].columns], # hover text includes years
    hoverinfo='text',
    name = 'Global Data from 1970 to 2014'
))

# Add y = x line
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    line=dict(color='red', dash='dash'),
    name='y = x Line'
))

# Set title/axis labels
fig.update_layout(
    title='Proportion enrolled in Primary School, using data from 1970 to 2014',
    xaxis=dict(title='Percent of Males'),
    yaxis=dict(title='Percent of Females'),
    legend=dict(x=0.05, y=0.9)
)

fig # Show the plot

With the help of these hover points, we can see that we have trended closer to gender equality in primary school enrollment over time. Great news!

Line Plots#

A line plot is used to display data as a series of points connected by a line. It’s generally used to visualize how a variable changes over time (also known as time series data), often with a time-related variable on the x-axis (minutes, days, months, years, etc.) and a numerical variable on the y-axis.

For example, let’s create a line plot to see how the total enrollment rate in primary school (across both sexes) changed from 1970 to 2014. We’ve done this below with matplotlib.

enrollment = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, both sexes (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:, 4:])[0]
years = np.arange(1970,2015)
plt.plot(years, enrollment)
plt.title('Global Primary School Enrollment Over Time')
plt.xlabel('Year')
plt.ylabel('Percent Enrolled in Primary School');
../../_images/4300ad5cf9f75843a00e9e66f6113749e8c9b16435c832fdc4c91d3e1d452fae.png

We can also do this with seaborn.

enrollment = np.array(wb_df[(wb_df['Indicator Name'] == 'Adjusted net enrolment rate, primary, both sexes (%)')  
     & (wb_df['Country Name'] == 'World')].iloc[:, 4:])[0]
years = np.arange(1970,2015)
sns.lineplot(x = years, y = enrollment)
plt.title('Global Primary School Enrollment Over Time')
plt.xlabel('Year')
plt.ylabel('Percent Enrolled in Primary School');
../../_images/4300ad5cf9f75843a00e9e66f6113749e8c9b16435c832fdc4c91d3e1d452fae.png

It appears that more people are being enrolled in primary school over time, great news!

Histograms, Box Plots and Violin Plots#

Histograms, box plots and violin plots are all ways to visualize the distribution of a single quantitative variable. They help you understand the spread of values in the data. To help explain these, we will import in another dataset on education from the World Bank. We have obtained this dataset from Data 100’s Course Notes Visualization I.

wb_df_2 = pd.read_csv('data/world_bank.csv').drop(columns={'Unnamed: 0'})
wb_df_2
Continent Country Primary completion rate: Male: % of relevant age group: 2015 Primary completion rate: Female: % of relevant age group: 2015 Lower secondary completion rate: Male: % of relevant age group: 2015 Lower secondary completion rate: Female: % of relevant age group: 2015 Youth literacy rate: Male: % of ages 15-24: 2005-14 Youth literacy rate: Female: % of ages 15-24: 2005-14 Adult literacy rate: Male: % ages 15 and older: 2005-14 Adult literacy rate: Female: % ages 15 and older: 2005-14 ... Access to improved sanitation facilities: % of population: 1990 Access to improved sanitation facilities: % of population: 2015 Child immunization rate: Measles: % of children ages 12-23 months: 2015 Child immunization rate: DTP3: % of children ages 12-23 months: 2015 Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 Children sleeping under treated bed nets: % of children under age 5: 2009-2016 Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 Tuberculosis: Treatment success rate: % of new cases: 2014 Tuberculosis: Cases detection rate: % of new estimated cases: 2015
0 Africa Algeria 106.0 105.0 68.0 85.0 96.0 92.0 83.0 68.0 ... 80.0 88.0 95.0 95.0 66.0 42.0 NaN NaN 88.0 80.0
1 Africa Angola NaN NaN NaN NaN 79.0 67.0 82.0 60.0 ... 22.0 52.0 55.0 64.0 NaN NaN 25.9 28.3 34.0 64.0
2 Africa Benin 83.0 73.0 50.0 37.0 55.0 31.0 41.0 18.0 ... 7.0 20.0 75.0 79.0 23.0 33.0 72.7 25.9 89.0 61.0
3 Africa Botswana 98.0 101.0 86.0 87.0 96.0 99.0 87.0 89.0 ... 39.0 63.0 97.0 95.0 NaN NaN NaN NaN 77.0 62.0
4 Africa Burundi 58.0 66.0 35.0 30.0 90.0 88.0 89.0 85.0 ... 42.0 48.0 93.0 94.0 55.0 43.0 53.8 25.4 91.0 51.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
161 S. America Guyana 87.0 81.0 NaN NaN 92.0 94.0 82.0 87.0 ... 76.0 84.0 99.0 95.0 84.0 29.0 7.4 7.4 69.0 80.0
162 S. America Paraguay 89.0 90.0 71.0 77.0 99.0 98.0 96.0 94.0 ... 52.0 89.0 83.0 93.0 NaN NaN NaN NaN 71.0 87.0
163 S. America Peru 99.0 100.0 84.0 87.0 99.0 99.0 97.0 90.0 ... 53.0 76.0 92.0 90.0 60.0 57.0 NaN NaN 87.0 80.0
164 S. America Suriname 90.0 99.0 36.0 65.0 98.0 99.0 95.0 94.0 ... NaN 79.0 94.0 89.0 76.0 61.0 43.4 0.0 77.0 80.0
165 S. America Uruguay 103.0 104.0 54.0 68.0 98.0 99.0 98.0 99.0 ... 92.0 96.0 96.0 95.0 NaN NaN NaN NaN 75.0 87.0

166 rows × 47 columns

Histograms#

Histograms look quite similar to bar charts, but have a few differences. Histograms follow the area principle and have two defining properties:

  1. As the values on the horizontal axis are numerical and therefore have fixed positions on the number line, the bins are drawn to scale and are contiguous (though some might be empty).

  2. The area of each bar is proportional to the number of entries (or percent of data values) in the corresponding bin. The histogram is said to be drawn on a density scale.

For example, let us visualize the percent of females who completed their primary education across various countries in 2015 using a histogram.

plt.hist(wb_df_2['Primary completion rate: Female: % of relevant age group: 2015'],
         density=True, edgecolor="white"); 
# If you don't specify density = True, it'll plot the counts on the y-axis instead
plt.title('Percent of Females who Completed their Primary Education across Various Countries in 2015')
plt.xlabel('Percent Completing Primary School')
plt.ylabel('Density');
../../_images/3ffdc5486eca6f280090e35c7a167b890440b4a22befb5c70e2ae9551024f042.png

Box Plots and Violin Plots#

A boxplot is similar to a histogram as it also visualizes the distribution of a numerical variable, but it gives you more specific statistics about the distribution: the minimum, lower quartile (value at the bottom 25th percentile), median (value at the 50th percentile), upper quartile (value at the top 25th percentile) and maximum. These values are all denoted by horizontal lines, and the interquartile range (area between the 25th and 75th percentiles) is shaded.

Let us visualize the same data as above, but with a boxplot instead.

sns.boxplot(data = wb_df_2['Primary completion rate: Female: % of relevant age group: 2015']);
plt.title('Percent of Females who Completed their Primary Education across Various Countries in 2015')
plt.ylabel('Percent Completing Primary School');
../../_images/bd52899dffe9ac9ed9bc1e17c4f7db36ec508034cb3ccccb9c163a1d800ec95f.png

Both the whiskers always have the same length (to be precise, the length is 1.5 times the inter quartile range). The existence of dots beyond the whiskers indicate outliers in the data.

A violin plot is very similar to a box plot. However, while a box plot uses horizontal lines to indicate the quartiles, a violin plot uses a vertical line in the center. Additionally, while the width of the box at any point does not encode meaning in a box plot, in a violin plot, the width of the plot indicates the density of the distribution at each possible value.

sns.violinplot(data=wb_df_2, y="Primary completion rate: Female: % of relevant age group: 2015");
plt.title('Percent of Females who Completed their Primary Education across Various Countries in 2015')
plt.ylabel('Percent Completing Primary School');
../../_images/9716c04999f426586d1399b16d60f7b278dde11acc02233c7f503bf7617419dd.png

Overlaid Histograms and Side-by-Side Box and Violin Plots.#

To compare a quantitative distribution across the different categories of a qualitative categorical variable, we can overlay histograms or plot side-by-side box and violin plots. Examples of all of those have been done below to visualize the difference in primary school completion rate for females vs males.

Overlaid Histograms#

data_female = wb_df_2['Primary completion rate: Female: % of relevant age group: 2015']
data_male = wb_df_2['Primary completion rate: Male: % of relevant age group: 2015']
bins = np.arange(25,131,5) # Using the same bins for both histograms

sns.histplot(data_female.dropna(), bins = bins, label='Female', edgecolor='black')
sns.histplot(data_male.dropna(), bins = bins, label='Male', edgecolor='black')

plt.title('Overlaid Histogram of Primary Completion Rate in 2015')
plt.xlabel('Percent Completing Primary School')
plt.ylabel('Frequency')

plt.legend();
../../_images/08f5f34dfeae841d5d05b0e3c3b98bb634dc39ff600a00adbecb603737c8f229.png

Side-by-Side Box Plots#

sns.boxplot(wb_df_2[['Primary completion rate: Female: % of relevant age group: 2015',
                    'Primary completion rate: Male: % of relevant age group: 2015']])
plt.title('Side-by-Side Box Plots of Primary Schooling Completion Rate in 2015')
plt.xlabel('Sex')
plt.xticks([0, 1], ['Female', 'Male']) # Manually setting the x-axis labels to avoid overplotting
plt.ylabel('Percent Completing Primary School');
../../_images/21a137c97e0d53bbe08e9fc4fbcc6bdaec51f86c44a36ca4629b1a07c47987d8.png

Side-by-Side Violin Plots#

sns.violinplot(wb_df_2[['Primary completion rate: Female: % of relevant age group: 2015',
                    'Primary completion rate: Male: % of relevant age group: 2015']])
plt.title('Side-by-Side Box Plots of Primary Schooling Completion Rate in 2015')
plt.xlabel('Sex')
plt.xticks([0, 1], ['Female', 'Male']) # Manually setting the x-axis labels to avoid overplotting
plt.ylabel('Percent Completing Primary School');
../../_images/55d3110239132d5d53f62ba7ceca94c82a2ab1dbf1a59b2a70a0e6a0a669608e.png