import pandas as pd
import numpy as np
import sqlalchemy
from pathlib import Path
%load_ext sql
SQL#
This chapter will introduce Structured Query Language, commonly abbreviated as SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), and go over some basic commands. SQL is a special programming language designed to communicate with databases.
It is important to note that SQL is an entirely different programming language to Python. However, Jupyter notebooks allow us to use magic commands to perform SQL queries. We will be using these magic commands for the purposes of this demonstration. However, if you ever need to access SQL databases at a company, you will likely have to use a relational database management sysmtem (RDMS) to access SQL databases instead.
Let us begin by importing in a dataset on California schools’ SAT performances and poverty data in the 2014-2015 academic year. We have gotten California School SAT Performance and Poverty Data from Stanford’s Public Affairs Data Journalism I, Fall 2016.
Connecting to a SQL Database in Jupyter#
The cells below load in SQL cell magic and connect to our database. The important thing to understand is that our tmpdb
Path specifies the path to our database, and then sqlite_conn
and engine
connect to the database. Finally, %sql engine
allows us to use SQL magic commands to query our engine
. After we have connected the engine, we can simply add %%sql
at the top of a cell to convert it into a SQL cell!
# These lines specify the location of our database file
dbfile = 'cdeschools.sqlite'
tmpdb = Path('data') / dbfile
# Specify the database connection path (in this case, a sqlite database in a file)
sqlite_conn = 'sqlite:///' + str(tmpdb)
# Run this cell to connect to database
engine = sqlalchemy.create_engine(sqlite_conn)
connection = engine.connect()
%sql engine
%%sql
/* This is a SQL cell! This particular line is inside a SQL comment.
The following line let us view all the tables inside our SQL database. */
SELECT * FROM sqlite_master WHERE type='table';
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | schools | schools | 2 | CREATE TABLE schools ( "CDSCode" VARCHAR(14) NOT NULL, "NCESDist" VARCHAR(7), "NCESSchool" VARCHAR(5), "StatusType" VARCHAR(7) NOT NULL, "County" VARCHAR(15) NOT NULL, "District" VARCHAR(74) NOT NULL, "School" VARCHAR(89), "Street" VARCHAR(62), "StreetAbr" VARCHAR(59), "City" VARCHAR(25), "Zip" VARCHAR(10), "State" VARCHAR(4), "MailStreet" VARCHAR(70), "MailStrAbr" VARCHAR(70), "MailCity" VARCHAR(22), "MailZip" VARCHAR(10), "MailState" VARCHAR(4), "Phone" VARCHAR(14), "Ext" VARCHAR(6), "Website" VARCHAR(96), "OpenDate" DATE, "ClosedDate" DATE, "Charter" BOOLEAN, "CharterNum" VARCHAR(4), "FundingType" VARCHAR(23), "DOC" VARCHAR(2) NOT NULL, "DOCType" VARCHAR(42) NOT NULL, "SOC" VARCHAR(4), "SOCType" VARCHAR(41), "EdOpsCode" VARCHAR(7), "EdOpsName" VARCHAR(43), "EILCode" VARCHAR(8), "EILName" VARCHAR(31), "GSoffered" VARCHAR(16), "GSserved" VARCHAR(5), "Virtual" VARCHAR(4), "Magnet" BOOLEAN, "Latitude" FLOAT, "Longitude" FLOAT, "AdmFName1" VARCHAR(17), "AdmLName1" VARCHAR(22), "AdmEmail1" VARCHAR(49), "AdmFName2" VARCHAR(15), "AdmLName2" VARCHAR(18), "AdmEmail2" VARCHAR(40), "AdmFName3" VARCHAR(8), "AdmLName3" VARCHAR(15), "AdmEmail3" VARCHAR(39), "LastUpdate" DATE NOT NULL, CHECK ("Charter" IN (0, 1)), CHECK ("Magnet" IN (0, 1)) ) |
table | frpm | frpm | 9041 | CREATE TABLE frpm ( "Academic Year" VARCHAR(9), "County Code" VARCHAR(4), "District Code" INTEGER, "School Code" VARCHAR(7), "County Name" VARCHAR(15), "District Name" VARCHAR(75), "School Name" VARCHAR(85), "District Type" VARCHAR(32), "School Type" VARCHAR(41), "Educational Option Type" VARCHAR(43), "NSLP Provision Status" VARCHAR(24), "Charter School (Y/N)" BOOLEAN, "Charter School Number" VARCHAR(4), "Charter Funding Type" VARCHAR(23), "IRC" BOOLEAN, "Low Grade" VARCHAR(5), "High Grade" VARCHAR(14), "Enrollment (K-12)" FLOAT, "Free Meal Count (K-12)" FLOAT, "Percent (%) Eligible Free (K-12)" FLOAT, "FRPM Count (K-12)" FLOAT, "Percent (%) Eligible FRPM (K-12)" FLOAT, "Enrollment (Ages 5-17)" FLOAT, "Free Meal Count (Ages 5-17)" FLOAT, "Percent (%) Eligible Free (Ages 5-17)" FLOAT, "FRPM Count (Ages 5-17)" FLOAT, "Percent (%) Eligible FRPM (Ages 5-17)" FLOAT, "2013-14 CALPADS Fall 1 Certification Status" BOOLEAN, CHECK ("Charter School (Y/N)" IN (0, 1)), CHECK ("IRC" IN (0, 1)), CHECK ("2013-14 CALPADS Fall 1 Certification Status" IN (0, 1)) ) |
table | satscores | satscores | 8814 | CREATE TABLE satscores ( cds VARCHAR(14) NOT NULL, rtype VARCHAR(1) NOT NULL, sname VARCHAR(50), dname VARCHAR(50), cname VARCHAR(15), enroll12 INTEGER NOT NULL, "NumTstTakr" INTEGER NOT NULL, "AvgScrRead" INTEGER, "AvgScrMath" INTEGER, "AvgScrWrite" INTEGER, "NumGE1500" INTEGER, "PctGE1500" FLOAT ) |
As you can see from the summary above, our database contains 3 tables schools
, frpm
(short for free or reduced price meals), and satscores
. Each of these tables contains some important information for our analysis.
The column on the right tells you what SQL command was run to produce each table. Don’t worry about the specifics for now, but note that the column on the right includes all the column names and data types in the table. Read more about the various data types in SQL here.
Let us also quickly see the first 5 rows from each of the tables.
%%sql
SELECT *
FROM schools
LIMIT 5
CDSCode | NCESDist | NCESSchool | StatusType | County | District | School | Street | StreetAbr | City | Zip | State | MailStreet | MailStrAbr | MailCity | MailZip | MailState | Phone | Ext | Website | OpenDate | ClosedDate | Charter | CharterNum | FundingType | DOC | DOCType | SOC | SOCType | EdOpsCode | EdOpsName | EILCode | EILName | GSoffered | GSserved | Virtual | Magnet | Latitude | Longitude | AdmFName1 | AdmLName1 | AdmEmail1 | AdmFName2 | AdmLName2 | AdmEmail2 | AdmFName3 | AdmLName3 | AdmEmail3 | LastUpdate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01100170000000 | 0691051 | None | Active | Alameda | Alameda County Office of Education | None | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | (510) 887-0152 | None | www.acoe.org | None | None | None | None | None | 00 | County Office of Education (COE) | None | None | None | None | None | None | None | None | None | None | 37.658212 | -122.09713 | L Karen | Monroe | lkmonroe@acoe.org | None | None | None | None | None | None | 2015-06-23 |
01100170109835 | 0691051 | 10546 | Closed | Alameda | Alameda County Office of Education | FAME Public Charter | 39899 Balentine Drive, Suite 335 | 39899 Balentine Dr., Ste. 335 | Newark | 94560-5359 | CA | 39899 Balentine Drive, Suite 335 | 39899 Balentine Dr., Ste. 335 | Newark | 94560-5359 | CA | None | None | None | 2005-08-29 | 2015-07-31 | 1 | 0728 | Directly funded | 00 | County Office of Education (COE) | 65 | K-12 Schools (Public) | TRAD | Traditional | ELEMHIGH | Elementary-High Combination | K-12 | K-12 | P | 0 | 37.521436 | -121.99391 | None | None | None | None | None | None | None | None | None | 2015-09-01 |
01100170112607 | 0691051 | 10947 | Active | Alameda | Alameda County Office of Education | Envision Academy for Arts & Technology | 1515 Webster Street | 1515 Webster St. | Oakland | 94612-3355 | CA | 1515 Webster Street | 1515 Webster St. | Oakland | 94612 | CA | (510) 596-8901 | None | www.envisionacademy.org/ | 2006-08-28 | None | 1 | 0811 | Directly funded | 00 | County Office of Education (COE) | 66 | High Schools (Public) | TRAD | Traditional | HS | High School | 9-12 | 9-12 | N | 0 | 37.80452 | -122.26815 | Laura | Robell | laura@envisionacademy.org | None | None | None | None | None | None | 2015-06-18 |
01100170118489 | 0691051 | 12283 | Closed | Alameda | Alameda County Office of Education | Aspire California College Preparatory Academy | 2125 Jefferson Avenue | 2125 Jefferson Ave. | Berkeley | 94703-1414 | CA | 1001 22nd Avenue, Suite 100 | 1001 22nd Ave., Ste. 100 | Oakland | 94606 | CA | None | None | www.aspirepublicschools.org | 2008-08-21 | 2015-06-30 | 1 | 1049 | Directly funded | 00 | County Office of Education (COE) | 66 | High Schools (Public) | TRAD | Traditional | HS | High School | 9-12 | 9-12 | N | 0 | 37.868991 | -122.27844 | None | None | None | None | None | None | None | None | None | 2015-07-01 |
01100170123968 | 0691051 | 12844 | Active | Alameda | Alameda County Office of Education | Community School for Creative Education | 2111 International Boulevard | 2111 International Blvd. | Oakland | 94606-4903 | CA | 2111 International Boulevard | 2111 International Blvd. | Oakland | 94606-4903 | CA | (510) 686-4131 | None | www.communityschoolforcreativeeducation.org | 2011-08-22 | None | 1 | 1284 | Directly funded | 00 | County Office of Education (COE) | 60 | Elementary Schools (Public) | TRAD | Traditional | ELEM | Elementary | K-8 | K-7 | N | 0 | 37.784648 | -122.23863 | Clifford | Thompson | cliffordt@communityschoolforcreativeeducation.org | None | None | None | None | None | None | 2016-07-18 |
%%sql
SELECT *
FROM frpm
LIMIT 5
Academic Year | County Code | District Code | School Code | County Name | District Name | School Name | District Type | School Type | Educational Option Type | NSLP Provision Status | Charter School (Y/N) | Charter School Number | Charter Funding Type | IRC | Low Grade | High Grade | Enrollment (K-12) | Free Meal Count (K-12) | Percent (%) Eligible Free (K-12) | FRPM Count (K-12) | Percent (%) Eligible FRPM (K-12) | Enrollment (Ages 5-17) | Free Meal Count (Ages 5-17) | Percent (%) Eligible Free (Ages 5-17) | FRPM Count (Ages 5-17) | Percent (%) Eligible FRPM (Ages 5-17) | 2013-14 CALPADS Fall 1 Certification Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2014-2015 | 01 | 10017 | 0109835 | Alameda | Alameda County Office of Education | FAME Public Charter | County Office of Education (COE) | K-12 Schools (Public) | Traditional | None | 1 | 0728 | Directly funded | 1 | K | 12 | 1087.0 | 565.0 | 0.5197792088316467 | 715.0 | 0.6577736890524379 | 1070.0 | 553.0 | 0.516822429906542 | 702.0 | 0.6560747663551402 | 1 |
2014-2015 | 01 | 10017 | 0112607 | Alameda | Alameda County Office of Education | Envision Academy for Arts & Technology | County Office of Education (COE) | High Schools (Public) | Traditional | None | 1 | 0811 | Directly funded | 1 | 9 | 12 | 395.0 | 186.0 | 0.4708860759493671 | 186.0 | 0.4708860759493671 | 376.0 | 182.0 | 0.48404255319148937 | 182.0 | 0.48404255319148937 | 1 |
2014-2015 | 01 | 10017 | 0118489 | Alameda | Alameda County Office of Education | Aspire California College Preparatory Academy | County Office of Education (COE) | High Schools (Public) | Traditional | None | 1 | 1049 | Directly funded | 1 | 9 | 12 | 244.0 | 134.0 | 0.5491803278688525 | 175.0 | 0.7172131147540983 | 230.0 | 128.0 | 0.5565217391304348 | 168.0 | 0.7304347826086957 | 1 |
2014-2015 | 01 | 10017 | 0123968 | Alameda | Alameda County Office of Education | Community School for Creative Education | County Office of Education (COE) | Elementary Schools (Public) | Traditional | Breakfast Provision 2 | 1 | 1284 | Directly funded | 1 | K | 8 | 191.0 | 113.0 | 0.5916230366492147 | 139.0 | 0.7277486910994765 | 190.0 | 113.0 | 0.5947368421052631 | 139.0 | 0.7315789473684211 | 1 |
2014-2015 | 01 | 10017 | 0124172 | Alameda | Alameda County Office of Education | Yu Ming Charter | County Office of Education (COE) | Elementary Schools (Public) | Traditional | None | 1 | 1296 | Directly funded | 1 | K | 8 | 257.0 | 14.0 | 0.054474708171206226 | 21.0 | 0.08171206225680934 | 257.0 | 14.0 | 0.054474708171206226 | 21.0 | 0.08171206225680934 | 1 |
%%sql
SELECT *
FROM satscores
LIMIT 5
cds | rtype | sname | dname | cname | enroll12 | NumTstTakr | AvgScrRead | AvgScrMath | AvgScrWrite | NumGE1500 | PctGE1500 |
---|---|---|---|---|---|---|---|---|---|---|---|
00000000000000 | X | None | None | None | 496901 | 210706 | 489 | 500 | 484 | 93334 | 44.3 |
01000000000000 | C | None | None | Alameda | 16978 | 8855 | 516 | 536 | 517 | 4900 | 55.34 |
01100170000000 | D | None | Alameda County Office of Education | Alameda | 398 | 88 | 418 | 418 | 417 | 14 | 15.91 |
01100170109835 | S | FAME Public Charter | Alameda County Office of Education | Alameda | 62 | 17 | 503 | 546 | 505 | 9 | 52.94 |
01100170112607 | S | Envision Academy for Arts & Technology | Alameda County Office of Education | Alameda | 75 | 71 | 397 | 387 | 395 | 5 | 7.04 |
Without further ado, let’s begin discussing SQL’s syntax!
Syntax#
It is common practice to capitalize all keywords in SQL queries. We shall stick to this common practice throughout this section.
While pandas
doesn’t differentiate between single quotes (‘’) and double quotes (“”), SQL does. You can read more about how Jupyter SQL cell magics handles quotes here. However, this may depend on the type of SQL you’re using - other types only use single quotes and no double quotes.
Furthermore, SQL queries have a certain order (i.e. they ought to start with the SELECT
statement and end with the LIMIT
clause if it exists). The order is as follows:
SELECT <columns>
FROM <table>
[INNER/LEFT/RIGHT/OUTER JOIN <table>]
[ON <columns> = <columns>]
[WHERE <condition>]
[GROUP BY <columns>]
[ORDER BY <columns>]
[LIMIT <rows>]
In the above order, the hard bracket []
denote optional portions (note that everything except SELECT
and FROM
are optional). We have stuck to this order when making this chapter.
SELECT
#
Every single SQL query must include a SELECT
statement with a FROM
command. The SELECT
statement tells SQL which columns you are interested in, and the FROM
command tells SQL which tables those columns are in. To select all the columns from a table, we can use *
in the select statement. For example, let us select all the columns from the table schools
.
%%sql
SELECT *
FROM schools
CDSCode | NCESDist | NCESSchool | StatusType | County | District | School | Street | StreetAbr | City | Zip | State | MailStreet | MailStrAbr | MailCity | MailZip | MailState | Phone | Ext | Website | OpenDate | ClosedDate | Charter | CharterNum | FundingType | DOC | DOCType | SOC | SOCType | EdOpsCode | EdOpsName | EILCode | EILName | GSoffered | GSserved | Virtual | Magnet | Latitude | Longitude | AdmFName1 | AdmLName1 | AdmEmail1 | AdmFName2 | AdmLName2 | AdmEmail2 | AdmFName3 | AdmLName3 | AdmEmail3 | LastUpdate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01100170000000 | 0691051 | None | Active | Alameda | Alameda County Office of Education | None | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | (510) 887-0152 | None | www.acoe.org | None | None | None | None | None | 00 | County Office of Education (COE) | None | None | None | None | None | None | None | None | None | None | 37.658212 | -122.09713 | L Karen | Monroe | lkmonroe@acoe.org | None | None | None | None | None | None | 2015-06-23 |
01100170109835 | 0691051 | 10546 | Closed | Alameda | Alameda County Office of Education | FAME Public Charter | 39899 Balentine Drive, Suite 335 | 39899 Balentine Dr., Ste. 335 | Newark | 94560-5359 | CA | 39899 Balentine Drive, Suite 335 | 39899 Balentine Dr., Ste. 335 | Newark | 94560-5359 | CA | None | None | None | 2005-08-29 | 2015-07-31 | 1 | 0728 | Directly funded | 00 | County Office of Education (COE) | 65 | K-12 Schools (Public) | TRAD | Traditional | ELEMHIGH | Elementary-High Combination | K-12 | K-12 | P | 0 | 37.521436 | -121.99391 | None | None | None | None | None | None | None | None | None | 2015-09-01 |
01100170112607 | 0691051 | 10947 | Active | Alameda | Alameda County Office of Education | Envision Academy for Arts & Technology | 1515 Webster Street | 1515 Webster St. | Oakland | 94612-3355 | CA | 1515 Webster Street | 1515 Webster St. | Oakland | 94612 | CA | (510) 596-8901 | None | www.envisionacademy.org/ | 2006-08-28 | None | 1 | 0811 | Directly funded | 00 | County Office of Education (COE) | 66 | High Schools (Public) | TRAD | Traditional | HS | High School | 9-12 | 9-12 | N | 0 | 37.80452 | -122.26815 | Laura | Robell | laura@envisionacademy.org | None | None | None | None | None | None | 2015-06-18 |
01100170118489 | 0691051 | 12283 | Closed | Alameda | Alameda County Office of Education | Aspire California College Preparatory Academy | 2125 Jefferson Avenue | 2125 Jefferson Ave. | Berkeley | 94703-1414 | CA | 1001 22nd Avenue, Suite 100 | 1001 22nd Ave., Ste. 100 | Oakland | 94606 | CA | None | None | www.aspirepublicschools.org | 2008-08-21 | 2015-06-30 | 1 | 1049 | Directly funded | 00 | County Office of Education (COE) | 66 | High Schools (Public) | TRAD | Traditional | HS | High School | 9-12 | 9-12 | N | 0 | 37.868991 | -122.27844 | None | None | None | None | None | None | None | None | None | 2015-07-01 |
01100170123968 | 0691051 | 12844 | Active | Alameda | Alameda County Office of Education | Community School for Creative Education | 2111 International Boulevard | 2111 International Blvd. | Oakland | 94606-4903 | CA | 2111 International Boulevard | 2111 International Blvd. | Oakland | 94606-4903 | CA | (510) 686-4131 | None | www.communityschoolforcreativeeducation.org | 2011-08-22 | None | 1 | 1284 | Directly funded | 00 | County Office of Education (COE) | 60 | Elementary Schools (Public) | TRAD | Traditional | ELEM | Elementary | K-8 | K-7 | N | 0 | 37.784648 | -122.23863 | Clifford | Thompson | cliffordt@communityschoolforcreativeeducation.org | None | None | None | None | None | None | 2016-07-18 |
01100170124172 | 0691051 | 12901 | Active | Alameda | Alameda County Office of Education | Yu Ming Charter | 1086 Alcatraz Avenue | 1086 Alcatraz Ave. | Oakland | 94608-1265 | CA | 1086 Alcatraz Avenue | 1086 Alcatraz Ave. | Oakland | 94608-1265 | CA | (510) 452-2063 | None | www.yumingschool.org | 2011-08-09 | None | 1 | 1296 | Directly funded | 00 | County Office of Education (COE) | 60 | Elementary Schools (Public) | TRAD | Traditional | ELEM | Elementary | K-8 | K-5 | N | 0 | 37.847375 | -122.28356 | Sue | Park | spark@yumingschool.org | None | None | None | None | None | None | 2016-05-10 |
01100170125567 | 0691051 | 13008 | Active | Alameda | Alameda County Office of Education | Urban Montessori Charter | 5328 Brann Street | 5328 Brann St. | Oakland | 94619-3312 | CA | 5328 Brann Street | 5328 Brann St. | Oakland | 94619-3312 | CA | (510) 842-1181 | None | www.urbanmontessori.org | 2012-08-27 | None | 1 | 1383 | Directly funded | 00 | County Office of Education (COE) | 60 | Elementary Schools (Public) | TRAD | Traditional | ELEM | Elementary | K-8 | K-5 | N | 0 | 37.778352 | -122.1895 | David | Castillo | davidc@urbanmontessori.org | None | None | None | None | None | None | 2015-10-09 |
01100170130302 | 0691051 | None | Closed | Alameda | Alameda County Office of Education | Technical, Agricultural & Nat. | 597 C Street | 597 C St. | Hayward | 94541 | CA | 597 C Street | 597 C St. | Hayward | 94541 | CA | None | None | None | 1980-07-01 | 1989-06-30 | 0 | None | None | 00 | County Office of Education (COE) | 10 | County Community | None | None | UG | Ungraded | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | 1999-06-24 |
01100170130401 | 0691051 | 09264 | Active | Alameda | Alameda County Office of Education | Alameda County Juvenile Hall/Court | 2500 Fairmont Avenue | 2500 Fairmont Ave. | San Leandro | 94578-1005 | CA | 2500 Fairmont Avenue | 2500 Fairmont Ave. | San Leandro | 94578-1005 | CA | (510) 667-7435 | None | www.acoe.org | 1980-07-01 | None | 0 | None | None | 00 | County Office of Education (COE) | 14 | Juvenile Court Schools | JUV | Juvenile Court School | HS | High School | K-12 | 8-12 | N | 0 | 37.712878 | -122.11173 | Michael | Hermosillo | mhermosillo@acoe.org | None | None | None | None | None | None | 2016-05-10 |
01100170130419 | 0691051 | 06830 | Active | Alameda | Alameda County Office of Education | Alameda County Community | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | (510) 670-4590 | None | www.acoe.org | 1980-07-01 | None | 0 | None | None | 00 | County Office of Education (COE) | 10 | County Community | COMM | County Community School | HS | High School | K-12 | 7-12 | P | 0 | 37.658212 | -122.09713 | Carolyn | Hobbs | chobbs@acoe.org | None | None | None | None | None | None | 2016-10-06 |
AS
#
We can rename columns using the AS
keyword within the SELECT
statement. For example, let us select just the school code column CDSCode
from the table schools
, and rename it to be County_District_Code
.
%%sql
SELECT CDSCode AS County_District_Code
FROM schools
County_District_Code |
---|
01100170000000 |
01100170109835 |
01100170112607 |
01100170118489 |
01100170123968 |
01100170124172 |
01100170125567 |
01100170130302 |
01100170130401 |
01100170130419 |
CAST
#
We can use the CAST
function within the SELECT
statement to convert the data types of columns if necessary. For example, we can see from the summary of all the tables that the District Code
column in the frpm
table is stored as an integer. However, since the district code is a qualitative ordinal variable rather than a quantative one (review the types of variables here), it would make more sense to store it as a varchar (review the types of data in SQL here). Let us make this change.
%%sql
SELECT CAST("District Code" AS varchar)
FROM frpm
CAST("District Code" AS varchar) |
---|
None |
None |
10017 |
10017 |
10017 |
10017 |
10017 |
10017 |
10017 |
10017 |
If you like, you can rename this new column to be something easier to interpret.
CASE
#
The CASE
expression helps us create new categorical columns based on other columns in SQL. The general syntax is:
CASE
WHEN condition1 THEN output1
WHEN condition2 THEN output2
WHEN conditionN THEN outputN
ELSE outputN+1
END AS colname
For example, let us generate a new column which classifies school sizes into small, medium or large based on the number of students they have enrolled in grade 12 (represented by the enroll12
column in the satscores
table).
%%sql
SELECT sname, enroll12, CASE
WHEN enroll12 <= 50 THEN "small"
WHEN enroll12 > 2000 THEN "large"
ELSE "medium"
END AS school_size
FROM satscores
sname | enroll12 | school_size |
---|---|---|
None | 496901 | large |
None | 16978 | large |
None | 398 | medium |
FAME Public Charter | 62 | medium |
Envision Academy for Arts & Technology | 75 | medium |
Aspire California College Preparatory Academy | 61 | medium |
None | 922 | medium |
Alameda Science and Technology Institute | 36 | small |
Nea Community Learning Center | 12 | small |
Alameda High | 465 | medium |
You will notice that the first 3 rows have no school name. That is because they correspond to state/county/district-wide data rather than data from a specific school. We can filter the dataset to only include schools using the WHERE
clause.
DISTINCT
#
The DISTINCT
keyword returns all the unique values in a column. For example, the code below returns all the unique values in the column rtype
from the table satscores
.
%%sql
SELECT DISTINCT(rtype)
FROM satscores
rtype |
---|
X |
C |
D |
S |
As an aside, the column rtype
represents the type of region being covered by the row. S
represents a single school, D
represents a school district, C
represents a county and X
represents the entire state.
JOIN
#
Joins in SQL work very similarly to joins in pandas
. The general syntax of a JOIN
clause is as follows:
SELECT <columns>
FROM <table>
(INNER)/(LEFT)/(RIGHT)/(FULL OUTER) JOIN <table>
ON <columns> = <columns>
For example, let us inner join the schools
and satscores
tables on their shared CDS Code column.
%%sql
SELECT schools."County", schools."District", schools."School",
satscores."NumTstTakr", satscores."AvgScrRead", satscores."AvgScrMath", satscores."PctGE1500"
FROM schools
INNER JOIN satscores
ON schools."CDSCode" = satscores.cds
County | District | School | NumTstTakr | AvgScrRead | AvgScrMath | PctGE1500 |
---|---|---|---|---|---|---|
Alameda | Alameda County Office of Education | None | 88 | 418 | 418 | 15.91 |
Alameda | Alameda County Office of Education | FAME Public Charter | 17 | 503 | 546 | 52.94 |
Alameda | Alameda County Office of Education | Envision Academy for Arts & Technology | 71 | 397 | 387 | 7.04 |
Alameda | Alameda County Office of Education | Aspire California College Preparatory Academy | 0 | None | None | None |
Alameda | Alameda Unified | None | 544 | 521 | 546 | 61.21 |
Alameda | Alameda Unified | Alameda Science and Technology Institute | 36 | 562 | 590 | 80.56 |
Alameda | Alameda Unified | Nea Community Learning Center | 6 | None | None | None |
Alameda | Alameda Unified | Alameda High | 325 | 543 | 573 | 70.46 |
Alameda | Alameda Unified | Alameda Community Learning Center | 20 | 575 | 593 | 80.0 |
Alameda | Alameda Unified | Alternatives in Action | 23 | 352 | 350 | 4.35 |
As you can see, when specifying columns in the query above, I specified which tables the columns came from. Technically this isn’t necessary if both tables don’t share the column name, but it is still good practice to do so.
WHERE
#
The WHERE
clause helps us filter rows we’re interested in. For example, we can filter the satscores
dataset to include all schools where over 50% of the school got over a 1500 on the SAT (as captured in the PctGE1500
column).
%%sql
SELECT *
FROM satscores
WHERE PctGE1500 >= 50
cds | rtype | sname | dname | cname | enroll12 | NumTstTakr | AvgScrRead | AvgScrMath | AvgScrWrite | NumGE1500 | PctGE1500 |
---|---|---|---|---|---|---|---|---|---|---|---|
01000000000000 | C | None | None | Alameda | 16978 | 8855 | 516 | 536 | 517 | 4900 | 55.34 |
01100170109835 | S | FAME Public Charter | Alameda County Office of Education | Alameda | 62 | 17 | 503 | 546 | 505 | 9 | 52.94 |
01611190000000 | D | None | Alameda Unified | Alameda | 922 | 544 | 521 | 546 | 519 | 333 | 61.21 |
01611190106401 | S | Alameda Science and Technology Institute | Alameda Unified | Alameda | 36 | 36 | 562 | 590 | 555 | 29 | 80.56 |
01611190130229 | S | Alameda High | Alameda Unified | Alameda | 465 | 325 | 543 | 573 | 543 | 229 | 70.46 |
01611190130609 | S | Alameda Community Learning Center | Alameda Unified | Alameda | 26 | 20 | 575 | 593 | 581 | 16 | 80.0 |
01611270000000 | D | None | Albany City Unified | Alameda | 289 | 178 | 572 | 599 | 574 | 136 | 76.4 |
01611270130450 | S | Albany High | Albany City Unified | Alameda | 273 | 178 | 572 | 599 | 574 | 136 | 76.4 |
01611430000000 | D | None | Berkeley Unified | Alameda | 885 | 490 | 566 | 562 | 559 | 337 | 68.78 |
01611430131177 | S | Berkeley High | Berkeley Unified | Alameda | 741 | 450 | 583 | 580 | 574 | 335 | 74.44 |
If we’re interested in filtering based on string data, it is recommended to use the LIKE
operator. As a reminder, while pandas
doesn’t differentiate between single quotes (‘’) and double quotes (“”), SQL does. When I find it easiest to just use double quotes when denoting (string) column names in Jupyter SQL cell magic, you can read more about the details here. For example, let us filter the dataset to include all the rows corresponding to a school (as indicated by the rtype
column being S
).
%%sql
SELECT *
FROM satscores
WHERE rtype LIKE "S"
cds | rtype | sname | dname | cname | enroll12 | NumTstTakr | AvgScrRead | AvgScrMath | AvgScrWrite | NumGE1500 | PctGE1500 |
---|---|---|---|---|---|---|---|---|---|---|---|
01100170109835 | S | FAME Public Charter | Alameda County Office of Education | Alameda | 62 | 17 | 503 | 546 | 505 | 9 | 52.94 |
01100170112607 | S | Envision Academy for Arts & Technology | Alameda County Office of Education | Alameda | 75 | 71 | 397 | 387 | 395 | 5 | 7.04 |
01100170118489 | S | Aspire California College Preparatory Academy | Alameda County Office of Education | Alameda | 61 | 0 | None | None | None | None | None |
01611190106401 | S | Alameda Science and Technology Institute | Alameda Unified | Alameda | 36 | 36 | 562 | 590 | 555 | 29 | 80.56 |
01611190119222 | S | Nea Community Learning Center | Alameda Unified | Alameda | 12 | 6 | None | None | None | None | None |
01611190130229 | S | Alameda High | Alameda Unified | Alameda | 465 | 325 | 543 | 573 | 543 | 229 | 70.46 |
01611190130609 | S | Alameda Community Learning Center | Alameda Unified | Alameda | 26 | 20 | 575 | 593 | 581 | 16 | 80.0 |
01611190130625 | S | Alternatives in Action | Alameda Unified | Alameda | 41 | 23 | 352 | 350 | 343 | 1 | 4.35 |
01611190132878 | S | Encinal High | Alameda Unified | Alameda | 238 | 132 | 483 | 504 | 476 | 56 | 42.42 |
01611190134304 | S | Island High (Continuation) | Alameda Unified | Alameda | 100 | 2 | None | None | None | None | None |
When using the LIKE
operator, the character %
represents a wildcard which matches any string (similar to the .*
pattern in RegEx). For example, let us say we’re interested in all high school names which end with the word ‘High’. As we’re only interested in high schools, we want to put filters on both the rtype
and the sname
columns. We extract those rows using the code below.
%%sql
SELECT *
FROM satscores
WHERE rtype LIKE "S"
AND sname LIKE "%High"
cds | rtype | sname | dname | cname | enroll12 | NumTstTakr | AvgScrRead | AvgScrMath | AvgScrWrite | NumGE1500 | PctGE1500 |
---|---|---|---|---|---|---|---|---|---|---|---|
01611190130229 | S | Alameda High | Alameda Unified | Alameda | 465 | 325 | 543 | 573 | 543 | 229 | 70.46 |
01611190132878 | S | Encinal High | Alameda Unified | Alameda | 238 | 132 | 483 | 504 | 476 | 56 | 42.42 |
01611270130450 | S | Albany High | Albany City Unified | Alameda | 273 | 178 | 572 | 599 | 574 | 136 | 76.4 |
01611430122697 | S | REALM Charter High | Berkeley Unified | Alameda | 76 | 36 | 384 | 360 | 386 | 2 | 5.56 |
01611430131177 | S | Berkeley High | Berkeley Unified | Alameda | 741 | 450 | 583 | 580 | 574 | 335 | 74.44 |
01611500132225 | S | Castro Valley High | Castro Valley Unified | Alameda | 688 | 438 | 527 | 556 | 524 | 273 | 62.33 |
01611760130062 | S | American High | Fremont Unified | Alameda | 457 | 309 | 573 | 608 | 592 | 243 | 78.64 |
01611760134270 | S | Irvington High | Fremont Unified | Alameda | 500 | 364 | 590 | 634 | 606 | 309 | 84.89 |
01611760134452 | S | John F. Kennedy High | Fremont Unified | Alameda | 368 | 179 | 485 | 509 | 485 | 77 | 43.02 |
01611760135244 | S | Mission San Jose High | Fremont Unified | Alameda | 518 | 453 | 653 | 699 | 671 | 421 | 92.94 |
%
was just an example, read more about the different wildcards in SQL here.
GROUP BY
#
The SQL GROUP BY
statement accomplishes the exact same task as the .groupby()
in pandas
(review the Pandas
subchapter if necessary). To specify which column you want to group by, you simply add that into the query, following the syntax below:
SELECT <columns>
FROM <table>
GROUP BY <columns>
Aggregate#
Specifying the aggregate function in SQL is not always intuitive for those who are familar with pandas
. In SQL, you specify the aggregate function inside the SELECT
statement. For example, let us say you are interested in the total number of test takers across all high schools in the different countries. In pandas
, you could do:
satscores[satscores['rtype'] == 'S'].groupby('cname')['NumTstTakr'].sum()
However, in SQL
, you must write the code below.
%%sql
SELECT cname, SUM(NumTstTakr)
FROM satscores
WHERE rtype = "S"
GROUP BY cname
cname | SUM(NumTstTakr) |
---|---|
Alameda | 8855 |
Amador | 97 |
Butte | 636 |
Calaveras | 125 |
Colusa | 104 |
Contra Costa | 7027 |
Del Norte | 81 |
El Dorado | 938 |
Fresno | 5907 |
Glenn | 107 |
An advantage of the SQL approach is that it’s quite easy to compute multiple aggregations at once.
%%sql
SELECT cname, SUM(NumTstTakr), AVG(PctGE1500)
FROM satscores
WHERE rtype = "S"
GROUP BY cname
cname | SUM(NumTstTakr) | AVG(PctGE1500) |
---|---|---|
Alameda | 8855 | 40.14071428571429 |
Amador | 97 | 59.925000000000004 |
Butte | 636 | 51.753 |
Calaveras | 125 | 62.885000000000005 |
Colusa | 104 | 30.33 |
Contra Costa | 7027 | 43.969722222222224 |
Del Norte | 81 | 45.45 |
El Dorado | 938 | 66.09428571428572 |
Fresno | 5907 | 25.784848484848485 |
Glenn | 107 | 33.626666666666665 |
SQL comes with multiple aggregate functions, including COUNT
, MIN
, MAX
, SUM
, AVG
, etc. Feel free to google around for more examples of aggregate functions and creative ways to use the functions above!
HAVING
#
The HAVING
clause in a SQL GROUP BY
is equivalent to the filter()
in a pandas
groupby (review the pandas
filter here if necessary). It goes directly after the GROUP BY
statement. For example, suppose we want to identify school districts where the average percentage of students scoring above 1500 on the SAT is less than 10%. We could write the following SQL code to achieve this.
%%sql
SELECT dname, cname, AVG(PctGE1500)
FROM satscores
WHERE rtype = "S"
GROUP BY dname
HAVING AVG(PctGE1500) <= 10
dname | cname | AVG(PctGE1500) |
---|---|---|
Calistoga Joint Unified | Napa | 7.69 |
Compton Unified | Los Angeles | 7.016666666666667 |
Delhi Unified | Merced | 9.09 |
Farmersville Unified | Tulare | 7.59 |
Firebaugh-Las Deltas Unified | Fresno | 4.76 |
Golden Plains Unified | Fresno | 7.27 |
Le Grand Union High | Merced | 8.47 |
Lucerne Valley Unified | San Bernardino | 0.0 |
Lynwood Unified | Los Angeles | 9.934999999999999 |
Mendota Unified | Fresno | 6.45 |
Note how we easily performed both an aggregation and a filter in the code above. The ease of writing this is an advantage of SQL’s groupby approach.
ORDER BY
#
Orders the table by a given column, in ascending (using ASC
) or descending (using DESC
) order. For example, let us say we’re interested in which schools have the highest proportion of students scoring above a 1500 on the SAT. We could use the code below to accomplish this.
%%sql
SELECT *
FROM satscores
WHERE rtype = "S"
ORDER BY PctGE1500 DESC
cds | rtype | sname | dname | cname | enroll12 | NumTstTakr | AvgScrRead | AvgScrMath | AvgScrWrite | NumGE1500 | PctGE1500 |
---|---|---|---|---|---|---|---|---|---|---|---|
30664313030616 | S | Oxford Academy | Anaheim Union High | Orange | 176 | 167 | 627 | 669 | 634 | 164 | 98.2 |
19642121931880 | S | Whitney (Gretchen) High | ABC Unified | Los Angeles | 164 | 154 | 639 | 687 | 661 | 151 | 98.05 |
43694684334421 | S | Lynbrook High | Fremont Union High | Santa Clara | 472 | 398 | 639 | 698 | 664 | 382 | 95.98 |
43694684334462 | S | Monta Vista High | Fremont Union High | Santa Clara | 608 | 513 | 638 | 691 | 657 | 487 | 94.93 |
43695344337762 | S | Saratoga High | Los Gatos-Saratoga Joint Union High | Santa Clara | 362 | 319 | 639 | 674 | 650 | 302 | 94.67 |
43696414332904 | S | Henry M. Gunn High | Palo Alto Unified | Santa Clara | 447 | 330 | 642 | 686 | 646 | 310 | 93.94 |
44104474430252 | S | Pacific Collegiate Charter | Santa Cruz County Office of Education | Santa Cruz | 71 | 66 | 638 | 638 | 630 | 62 | 93.94 |
19642611995299 | S | Rancho Learning Center (Alternative) | Arcadia Unified | Los Angeles | 38 | 32 | 605 | 641 | 601 | 30 | 93.75 |
07616300734244 | S | Miramonte High | Acalanes Union High | Contra Costa | 253 | 218 | 628 | 633 | 633 | 203 | 93.12 |
01611760135244 | S | Mission San Jose High | Fremont Unified | Alameda | 518 | 453 | 653 | 699 | 671 | 421 | 92.94 |
LIMIT
#
Limits the number of rows to be returned. For example, the code below accomplishes the same thing as schools.head(5)
in pandas
.
%%sql
SELECT *
FROM schools
LIMIT 5
CDSCode | NCESDist | NCESSchool | StatusType | County | District | School | Street | StreetAbr | City | Zip | State | MailStreet | MailStrAbr | MailCity | MailZip | MailState | Phone | Ext | Website | OpenDate | ClosedDate | Charter | CharterNum | FundingType | DOC | DOCType | SOC | SOCType | EdOpsCode | EdOpsName | EILCode | EILName | GSoffered | GSserved | Virtual | Magnet | Latitude | Longitude | AdmFName1 | AdmLName1 | AdmEmail1 | AdmFName2 | AdmLName2 | AdmEmail2 | AdmFName3 | AdmLName3 | AdmEmail3 | LastUpdate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01100170000000 | 0691051 | None | Active | Alameda | Alameda County Office of Education | None | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | 313 West Winton Avenue | 313 West Winton Ave. | Hayward | 94544-1136 | CA | (510) 887-0152 | None | www.acoe.org | None | None | None | None | None | 00 | County Office of Education (COE) | None | None | None | None | None | None | None | None | None | None | 37.658212 | -122.09713 | L Karen | Monroe | lkmonroe@acoe.org | None | None | None | None | None | None | 2015-06-23 |
01100170109835 | 0691051 | 10546 | Closed | Alameda | Alameda County Office of Education | FAME Public Charter | 39899 Balentine Drive, Suite 335 | 39899 Balentine Dr., Ste. 335 | Newark | 94560-5359 | CA | 39899 Balentine Drive, Suite 335 | 39899 Balentine Dr., Ste. 335 | Newark | 94560-5359 | CA | None | None | None | 2005-08-29 | 2015-07-31 | 1 | 0728 | Directly funded | 00 | County Office of Education (COE) | 65 | K-12 Schools (Public) | TRAD | Traditional | ELEMHIGH | Elementary-High Combination | K-12 | K-12 | P | 0 | 37.521436 | -121.99391 | None | None | None | None | None | None | None | None | None | 2015-09-01 |
01100170112607 | 0691051 | 10947 | Active | Alameda | Alameda County Office of Education | Envision Academy for Arts & Technology | 1515 Webster Street | 1515 Webster St. | Oakland | 94612-3355 | CA | 1515 Webster Street | 1515 Webster St. | Oakland | 94612 | CA | (510) 596-8901 | None | www.envisionacademy.org/ | 2006-08-28 | None | 1 | 0811 | Directly funded | 00 | County Office of Education (COE) | 66 | High Schools (Public) | TRAD | Traditional | HS | High School | 9-12 | 9-12 | N | 0 | 37.80452 | -122.26815 | Laura | Robell | laura@envisionacademy.org | None | None | None | None | None | None | 2015-06-18 |
01100170118489 | 0691051 | 12283 | Closed | Alameda | Alameda County Office of Education | Aspire California College Preparatory Academy | 2125 Jefferson Avenue | 2125 Jefferson Ave. | Berkeley | 94703-1414 | CA | 1001 22nd Avenue, Suite 100 | 1001 22nd Ave., Ste. 100 | Oakland | 94606 | CA | None | None | www.aspirepublicschools.org | 2008-08-21 | 2015-06-30 | 1 | 1049 | Directly funded | 00 | County Office of Education (COE) | 66 | High Schools (Public) | TRAD | Traditional | HS | High School | 9-12 | 9-12 | N | 0 | 37.868991 | -122.27844 | None | None | None | None | None | None | None | None | None | 2015-07-01 |
01100170123968 | 0691051 | 12844 | Active | Alameda | Alameda County Office of Education | Community School for Creative Education | 2111 International Boulevard | 2111 International Blvd. | Oakland | 94606-4903 | CA | 2111 International Boulevard | 2111 International Blvd. | Oakland | 94606-4903 | CA | (510) 686-4131 | None | www.communityschoolforcreativeeducation.org | 2011-08-22 | None | 1 | 1284 | Directly funded | 00 | County Office of Education (COE) | 60 | Elementary Schools (Public) | TRAD | Traditional | ELEM | Elementary | K-8 | K-7 | N | 0 | 37.784648 | -122.23863 | Clifford | Thompson | cliffordt@communityschoolforcreativeeducation.org | None | None | None | None | None | None | 2016-07-18 |
Note: SQL is designed for very fast and efficient use with extremely large databases. However, if you try to access all the rows at once, SQL won’t prevent them from being returned. This could crash your computer as you may not have enough space to accomodate such a large output. Please always LIMIT
your SQL queries unless you are 100% sure the output will not be excessively large. We haven’t limited our queries in this chapter because we were 100% sure that we are working with small-ish datasets; otherwise we would definitely limit our queries.