import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

More Advanced Visualizations#

This chapter shall discuss some more advanced visualization techniques. Let us begin by loading a World Bank dataset on education.

wb_df = pd.read_csv('data/world_bank.csv').drop(columns={'Unnamed: 0'})
wb_df
Continent Country Primary completion rate: Male: % of relevant age group: 2015 Primary completion rate: Female: % of relevant age group: 2015 Lower secondary completion rate: Male: % of relevant age group: 2015 Lower secondary completion rate: Female: % of relevant age group: 2015 Youth literacy rate: Male: % of ages 15-24: 2005-14 Youth literacy rate: Female: % of ages 15-24: 2005-14 Adult literacy rate: Male: % ages 15 and older: 2005-14 Adult literacy rate: Female: % ages 15 and older: 2005-14 ... Access to improved sanitation facilities: % of population: 1990 Access to improved sanitation facilities: % of population: 2015 Child immunization rate: Measles: % of children ages 12-23 months: 2015 Child immunization rate: DTP3: % of children ages 12-23 months: 2015 Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 Children sleeping under treated bed nets: % of children under age 5: 2009-2016 Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 Tuberculosis: Treatment success rate: % of new cases: 2014 Tuberculosis: Cases detection rate: % of new estimated cases: 2015
0 Africa Algeria 106.0 105.0 68.0 85.0 96.0 92.0 83.0 68.0 ... 80.0 88.0 95.0 95.0 66.0 42.0 NaN NaN 88.0 80.0
1 Africa Angola NaN NaN NaN NaN 79.0 67.0 82.0 60.0 ... 22.0 52.0 55.0 64.0 NaN NaN 25.9 28.3 34.0 64.0
2 Africa Benin 83.0 73.0 50.0 37.0 55.0 31.0 41.0 18.0 ... 7.0 20.0 75.0 79.0 23.0 33.0 72.7 25.9 89.0 61.0
3 Africa Botswana 98.0 101.0 86.0 87.0 96.0 99.0 87.0 89.0 ... 39.0 63.0 97.0 95.0 NaN NaN NaN NaN 77.0 62.0
4 Africa Burundi 58.0 66.0 35.0 30.0 90.0 88.0 89.0 85.0 ... 42.0 48.0 93.0 94.0 55.0 43.0 53.8 25.4 91.0 51.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
161 S. America Guyana 87.0 81.0 NaN NaN 92.0 94.0 82.0 87.0 ... 76.0 84.0 99.0 95.0 84.0 29.0 7.4 7.4 69.0 80.0
162 S. America Paraguay 89.0 90.0 71.0 77.0 99.0 98.0 96.0 94.0 ... 52.0 89.0 83.0 93.0 NaN NaN NaN NaN 71.0 87.0
163 S. America Peru 99.0 100.0 84.0 87.0 99.0 99.0 97.0 90.0 ... 53.0 76.0 92.0 90.0 60.0 57.0 NaN NaN 87.0 80.0
164 S. America Suriname 90.0 99.0 36.0 65.0 98.0 99.0 95.0 94.0 ... NaN 79.0 94.0 89.0 76.0 61.0 43.4 0.0 77.0 80.0
165 S. America Uruguay 103.0 104.0 54.0 68.0 98.0 99.0 98.0 99.0 ... 92.0 96.0 96.0 95.0 NaN NaN NaN NaN 75.0 87.0

166 rows × 47 columns

Transformations#

It is very advantageous for us to have linear data, simply because linear lines are easy to interpret. Given a linear line, you can say easily interpret the slope as saying that ‘a 1 unit increase in the x-variable corresponds to a m-unit in the y-variable’, and you can interpret the intercept as saying ‘when the x-variable is 0, the y-variable is c’. However, we often have to work with data that doesn’t appear to be linear.

Sometimes, when some data appears to not be linear, we can actually transform it so it has a linear relationship. For example, we plot the country’s gross national income (GNI) per capita vs their male adult literacy rate, and the plot does not appear linear at all.

plt.scatter(wb_df['Gross national income per capita, Atlas method: $: 2016'],
            wb_df["Adult literacy rate: Male: % ages 15 and older: 2005-14"]);
plt.title('Male Adult Literacy vs GNI per capita for various countries')
plt.xlabel('GNI per capita (US dollars, 2016)')
plt.ylabel('Male Adult Literacy (%, 2005-2014)');
../../_images/2c054838f427e158d088494d648a3bb80afbb491a50effdc98d4b6287959d9c1.png

However, let us try to take the natural log of the GNI per capita and raise the male adult literacy rate to the 4th power.

plt.scatter(np.log(wb_df['Gross national income per capita, Atlas method: $: 2016']),
            (wb_df["Adult literacy rate: Male: % ages 15 and older: 2005-14"])**4);
plt.title('Male Adult Literacy to the fourth power vs Log GNI per capita for various countries')
plt.xlabel('Log GNI per capita (Log US dollars, 2016)')
plt.ylabel('Male Adult Literacy to the fourth power (%^4, 2005-2014)');
../../_images/777e05580e3befdeae3d00f9a0d24d90e2c2bf623cf22ddb361f6864bce07b46.png

This graph looks a lot more linear! We can even draw a line of best fit. Seaborn’s .lmplot() takes in two columns from a DataFrame and plots them, along with a line of best fit and the confidence interval for the line of best fit. So, in the code below, we add in two columns with the transformed variables into the original DataFrame and then we plot them.

wb_df['log_gni'] = np.log(wb_df['Gross national income per capita, Atlas method: $: 2016'])
wb_df['male adult literacy ** 4'] = (wb_df["Adult literacy rate: Male: % ages 15 and older: 2005-14"])**4
sns.lmplot(x='log_gni', y='male adult literacy ** 4', data = wb_df);
plt.title('Male Adult Literacy to the fourth power vs Log GNI per capita for various countries')
plt.xlabel('Log GNI per capita (Log US dollars, 2016)')
plt.ylabel('Male Adult Literacy to the fourth power (%^4, 2005-2014)')
plt.ylim(0 - (100**3),(100**4) + (100**3));
../../_images/033654a05cc7a8df3b2c04c7b760f249324b557a5ec368c35fba432d8610a33f.png

This looks great! However, we made a logical leap there by knowing to take the log of the x-variable and raise the y-variable to the fourth power. How did we know to do this?

Well, the honest answer is practice. Lots and lots of practice. However, there are helpful guides which can help you. For example, there is the Tukey-Mosteller Bulge Diagram, as shown below.

Tukey-Mosteller Bulge Diagram#

In the diagram above, each curve represents a direction that your data can trend in. For example, our original adult literacy vs GNI graph was trending towards the top left, corresponding to the top-left of the Tukey-Mosteller diagram. Looking at the top left curve, we can see that it is surrounded by several suggestions, namely \(\log{X}, \sqrt{X}, Y^2\) and \(Y^3\). Looking at the pattern, it seems like we want to transform the x-axis to increase slower and/or transform the y-axis to increase faster. From there, we can trial and error a bunch of different transformations until we find something that works! Note that we don’t have to use both a x-axis and a y-axis transformation, a lot of times just a single transformation is enough!

Kernel Density Estimation#

A kernel density estimate (KDE) is a smooth, continuous function that is meant to give us a general idea of where the data lies along a dataset. You can think of it as attempting to approximate a histogram with a single line. For example, we can add a KDE on top of the histogram made in the previous subchapter.

sns.histplot(wb_df['Primary completion rate: Female: % of relevant age group: 2015'],kde=True,stat='density'); 
plt.title('Percent of Females who Completed their Primary Education across Various Countries in 2015')
plt.xlabel('Percent Completing Primary School')
plt.ylabel('Density');
../../_images/a1ed5c251333d066c449abec95c1cd71b75dbfc2bea0aa7559716ce01d676da6.png

As you can see, the kernel density estimate does a fairly decent job of approximating the distribution of the data. Look here for a beautiful explanation of how KDEs are calculated.

There are also multiple ways you can use KDE’s to visualize 2D distributions of data. Some examples are shown below using seaborn's .jointplot() and .kdeplot(). We plot the primary school completion rate for males vs females across different countries in 2015.

sns.jointplot(data=wb_df, x='Primary completion rate: Female: % of relevant age group: 2015',
                y='Primary completion rate: Male: % of relevant age group: 2015'); 
plt.suptitle('Primary Schooling Completion Rate (2015)', y = 1.02) 
# sup title helps us move the title further up
plt.xlabel('Females (%)')
plt.ylabel('Males (%)');
# No KDE here, just a scatter with histograms
../../_images/e657b5d96c1550931a3ddc514222ec386fbb28e0dec6eb7b736ac289781b51be.png
sns.jointplot(data=wb_df, x='Primary completion rate: Female: % of relevant age group: 2015',
                y='Primary completion rate: Male: % of relevant age group: 2015',kind = 'reg'); 
plt.suptitle('Primary Schooling Completion Rate (2015)', y = 1.02) 
plt.xlabel('Females (%)')
plt.ylabel('Males (%)');
# A KDE, histograms and a regression line here
../../_images/70e633fb027048e2aa56072b3c0f09d10ccb7357fd9fafcdafb125c5f5e8db42.png
sns.jointplot(data=wb_df, x='Primary completion rate: Female: % of relevant age group: 2015',
                y='Primary completion rate: Male: % of relevant age group: 2015',kind = 'kde'); 
plt.suptitle('Primary Schooling Completion Rate (2015)', y = 1.02) 
plt.xlabel('Females (%)')
plt.ylabel('Males (%)');
# Only KDEs here
../../_images/d061be1fe6967b41bd05a4dab38d7ccfbf1e393f6ca8b027b7b4a3d39d9a1d8b.png
sns.kdeplot(data=wb_df, x='Primary completion rate: Female: % of relevant age group: 2015',
                y='Primary completion rate: Male: % of relevant age group: 2015'); 
plt.suptitle('Primary Schooling Completion Rate (2015)', y = 1.02) 
plt.xlabel('Females (%)')
plt.ylabel('Males (%)');
# The same kde plot as with jointplot, but without fancy axes
../../_images/d96c8dddb5ce6aca32ecd93c0c06629ca86e89fac625bbfdceca140488fbdbac.png

The two plots above are 2-dimensional KDE plots. There are more points where the lines are closer together, and fewer points where they are further apart. We can also set the fill parameter to True to shade in our 2D KDE plot and get a density plot where the darker shades represent more data points.

sns.jointplot(data=wb_df, x='Primary completion rate: Female: % of relevant age group: 2015',
                y='Primary completion rate: Male: % of relevant age group: 2015',kind = 'kde', 
              fill = True); 
plt.suptitle('Primary Schooling Completion Rate (2015)', y = 1.02) 
plt.xlabel('Females (%)')
plt.ylabel('Males (%)');
../../_images/7775bc2ac35d20ffcd2154db1a98050fcb0fb292f50c34759a3b435aa2331ab1.png

Geospatial Data#

Geopandas and Plotly can be used to make some awesome geospatial maps in pandas. Read here to learn more about plotting with geopandas and read here to learn more about plotting with plotly's chloropleth maps.