SQL

import pandas as pd
import numpy as np
import sqlalchemy
from pathlib import Path
%load_ext sql

This chapter will introduce Structured Query Language, commonly abbreviated as SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), and go over some basic commands. SQL is a special programming language designed to communicate with databases.

It is important to note that SQL is an entirely different programming language to Python. However, Jupyter notebooks allow us to use magic commands to perform SQL queries. We will be using these magic commands for the purposes of this demonstration. However, if you ever need to access SQL databases at a company, you will likely have to use a relational database management sysmtem (RDMS) to access SQL databases instead.

Let us begin by importing in a dataset on California schools’ SAT performances and poverty data in the 2014-2015 academic year. We have gotten California School SAT Performance and Poverty Data from Stanford’s Public Affairs Data Journalism I, Fall 2016.

Connecting to a SQL Database in Jupyter¶

The cells below load in SQL cell magic and connect to our database. The important thing to understand is that our tmpdb Path specifies the path to our database, and then sqlite_conn and engine connect to the database. Finally, %sql engine allows us to use SQL magic commands to query our engine. After we have connected the engine, we can simply add %%sql at the top of a cell to convert it into a SQL cell!

# These lines specify the location of our database file
dbfile = 'cdeschools.sqlite'
tmpdb = Path('data') / dbfile

# Specify the database connection path (in this case, a sqlite database in a file)
sqlite_conn = 'sqlite:///' + str(tmpdb)

# Run this cell to connect to database
engine = sqlalchemy.create_engine(sqlite_conn)
connection = engine.connect()

%sql engine

%%sql
/* This is a SQL cell! This particular line is inside a SQL comment. 
The following line let us view all the tables inside our SQL database. */
SELECT * FROM sqlite_master WHERE type='table';

As you can see from the summary above, our database contains 3 tables schools, frpm (short for free or reduced price meals), and satscores. Each of these tables contains some important information for our analysis.

The column on the right tells you what SQL command was run to produce each table. Don’t worry about the specifics for now, but note that the column on the right includes all the column names and data types in the table. Read more about the various data types in SQL here.

Let us also quickly see the first 5 rows from each of the tables.

%%sql

SELECT * 
FROM schools 
LIMIT 5

%%sql

SELECT * 
FROM frpm 
LIMIT 5

%%sql

SELECT * 
FROM satscores 
LIMIT 5

Without further ado, let’s begin discussing SQL’s syntax!

Syntax¶

It is common practice to capitalize all keywords in SQL queries. We shall stick to this common practice throughout this section.

While pandas doesn’t differentiate between single quotes (‘’) and double quotes (“”), SQL does. You can read more about how Jupyter SQL cell magics handles quotes here. However, this may depend on the type of SQL you’re using - other types only use single quotes and no double quotes.

Furthermore, SQL queries have a certain order (i.e. they ought to start with the SELECT statement and end with the LIMIT clause if it exists). The order is as follows:

SELECT <columns>
FROM <table>
[INNER/LEFT/RIGHT/OUTER JOIN <table>]
[ON <columns> = <columns>]
[WHERE <condition>]
[GROUP BY <columns>]
[ORDER BY <columns>]
[LIMIT <rows>]

In the above order, the hard bracket []denote optional portions (note that everything except SELECT and FROM are optional). We have stuck to this order when making this chapter.

`SELECT`¶

Every single SQL query must include a SELECT statement with a FROM command. The SELECT statement tells SQL which columns you are interested in, and the FROM command tells SQL which tables those columns are in. To select all the columns from a table, we can use * in the select statement. For example, let us select all the columns from the table schools.

%%sql

SELECT * 
FROM schools

`AS`¶

We can rename columns using the AS keyword within the SELECT statement. For example, let us select just the school code column CDSCode from the table schools, and rename it to be County_District_Code.

%%sql

SELECT CDSCode AS County_District_Code
FROM schools

`CAST`¶

We can use the CAST function within the SELECT statement to convert the data types of columns if necessary. For example, we can see from the summary of all the tables that the District Code column in the frpm table is stored as an integer. However, since the district code is a qualitative ordinal variable rather than a quantative one (review the types of variables here), it would make more sense to store it as a varchar (review the types of data in SQL here). Let us make this change.

%%sql

SELECT CAST("District Code" AS varchar)
FROM frpm

If you like, you can rename this new column to be something easier to interpret.

`CASE`¶

The CASE expression helps us create new categorical columns based on other columns in SQL. The general syntax is:

CASE
    WHEN condition1 THEN output1
    WHEN condition2 THEN output2
    WHEN conditionN THEN outputN
    ELSE outputN+1
END AS colname

For example, let us generate a new column which classifies school sizes into small, medium or large based on the number of students they have enrolled in grade 12 (represented by the enroll12 column in the satscores table).

%%sql

SELECT sname, enroll12, CASE
        WHEN enroll12 <= 50 THEN "small"
        WHEN enroll12 > 2000 THEN "large"
        ELSE "medium"
    END AS school_size
FROM satscores

You will notice that the first 3 rows have no school name. That is because they correspond to state/county/district-wide data rather than data from a specific school. We can filter the dataset to only include schools using the WHERE clause.

`DISTINCT`¶

The DISTINCT keyword returns all the unique values in a column. For example, the code below returns all the unique values in the column rtype from the table satscores.

%%sql

SELECT DISTINCT(rtype)
FROM satscores

As an aside, the column rtype represents the type of region being covered by the row. S represents a single school, D represents a school district, C represents a county and X represents the entire state.

`JOIN`¶

Joins in SQL work very similarly to joins in pandas. The general syntax of a JOIN clause is as follows:

SELECT <columns>
FROM <table>
(INNER)/(LEFT)/(RIGHT)/(FULL OUTER) JOIN <table>
ON <columns> = <columns>

For example, let us inner join the schools and satscores tables on their shared CDS Code column.

%%sql

SELECT schools."County", schools."District", schools."School", 
    satscores."NumTstTakr", satscores."AvgScrRead", satscores."AvgScrMath", satscores."PctGE1500"
FROM schools
INNER JOIN satscores
ON schools."CDSCode" = satscores.cds

As you can see, when specifying columns in the query above, I specified which tables the columns came from. Technically this isn’t necessary if both tables don’t share the column name, but it is still good practice to do so.

`WHERE`¶

The WHERE clause helps us filter rows we’re interested in. For example, we can filter the satscores dataset to include all schools where over 50% of the school got over a 1500 on the SAT (as captured in the PctGE1500 column).

%%sql

SELECT *
FROM satscores
WHERE PctGE1500 >= 50

If we’re interested in filtering based on string data, it is recommended to use the LIKE operator. As a reminder, while pandas doesn’t differentiate between single quotes (‘’) and double quotes (“”), SQL does. When I find it easiest to just use double quotes when denoting (string) column names in Jupyter SQL cell magic, you can read more about the details here. For example, let us filter the dataset to include all the rows corresponding to a school (as indicated by the rtype column being S).

%%sql

SELECT *
FROM satscores
WHERE rtype LIKE "S"

When using the LIKE operator, the character % represents a wildcard which matches any string (similar to the .* pattern in RegEx). For example, let us say we’re interested in all high school names which end with the word ‘High’. As we’re only interested in high schools, we want to put filters on both the rtype and the sname columns. We extract those rows using the code below.

%%sql

SELECT *
FROM satscores
WHERE rtype LIKE "S" 
AND sname LIKE "%High"

% was just an example, read more about the different wildcards in SQL here.

`GROUP BY`¶

The SQL GROUP BY statement accomplishes the exact same task as the .groupby() in pandas (review the Pandas subchapter if necessary). To specify which column you want to group by, you simply add that into the query, following the syntax below:

SELECT <columns>
FROM <table>
GROUP BY <columns>

Aggregate¶

Specifying the aggregate function in SQL is not always intuitive for those who are familar with pandas. In SQL, you specify the aggregate function inside the SELECT statement. For example, let us say you are interested in the total number of test takers across all high schools in the different countries. In pandas, you could do:

satscores[satscores['rtype'] == 'S'].groupby('cname')['NumTstTakr'].sum()

However, in SQL, you must write the code below.

%%sql

SELECT cname, SUM(NumTstTakr)
FROM satscores
WHERE rtype = "S"
GROUP BY cname

An advantage of the SQL approach is that it’s quite easy to compute multiple aggregations at once.

%%sql

SELECT cname, SUM(NumTstTakr), AVG(PctGE1500)
FROM satscores
WHERE rtype = "S"
GROUP BY cname

SQL comes with multiple aggregate functions, including COUNT, MIN, MAX, SUM, AVG, etc. Feel free to google around for more examples of aggregate functions and creative ways to use the functions above!

`HAVING`¶

The HAVING clause in a SQL GROUP BY is equivalent to the filter() in a pandas groupby (review the pandas filter here if necessary). It goes directly after the GROUP BY statement. For example, suppose we want to identify school districts where the average percentage of students scoring above 1500 on the SAT is less than 10%. We could write the following SQL code to achieve this.

%%sql

SELECT dname, cname, AVG(PctGE1500)
FROM satscores
WHERE rtype = "S"
GROUP BY dname
HAVING AVG(PctGE1500) <= 10

Note how we easily performed both an aggregation and a filter in the code above. The ease of writing this is an advantage of SQL’s groupby approach.

`ORDER BY`¶

Orders the table by a given column, in ascending (using ASC) or descending (using DESC) order. For example, let us say we’re interested in which schools have the highest proportion of students scoring above a 1500 on the SAT. We could use the code below to accomplish this.

%%sql

SELECT *
FROM satscores
WHERE rtype = "S"
ORDER BY PctGE1500 DESC

`LIMIT`¶

Limits the number of rows to be returned. For example, the code below accomplishes the same thing as schools.head(5) in pandas.

%%sql

SELECT *
FROM schools
LIMIT 5

Note: SQL is designed for very fast and efficient use with extremely large databases. However, if you try to access all the rows at once, SQL won’t prevent them from being returned. This could crash your computer as you may not have enough space to accomodate such a large output. Please always LIMIT your SQL queries unless you are 100% sure the output will not be excessively large. We haven’t limited our queries in this chapter because we were 100% sure that we are working with small-ish datasets; otherwise we would definitely limit our queries.

Connecting to a SQL Database in Jupyter¶

Syntax¶

SELECT¶

AS¶

CAST¶

CASE¶

DISTINCT¶

JOIN¶

WHERE¶

GROUP BY¶