import pandas as pd
import numpy as np
import sqlalchemy
from pathlib import Path
%load_ext sql

SQL#

This chapter will introduce Structured Query Language, commonly abbreviated as SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), and go over some basic commands. SQL is a special programming language designed to communicate with databases.

It is important to note that SQL is an entirely different programming language to Python. However, Jupyter notebooks allow us to use magic commands to perform SQL queries. We will be using these magic commands for the purposes of this demonstration. However, if you ever need to access SQL databases at a company, you will likely have to use a relational database management sysmtem (RDMS) to access SQL databases instead.

Let us begin by importing in a dataset on California schools’ SAT performances and poverty data in the 2014-2015 academic year. We have gotten California School SAT Performance and Poverty Data from Stanford’s Public Affairs Data Journalism I, Fall 2016.

Connecting to a SQL Database in Jupyter#

The cells below load in SQL cell magic and connect to our database. The important thing to understand is that our tmpdb Path specifies the path to our database, and then sqlite_conn and engine connect to the database. Finally, %sql engine allows us to use SQL magic commands to query our engine. After we have connected the engine, we can simply add %%sql at the top of a cell to convert it into a SQL cell!

# These lines specify the location of our database file
dbfile = 'cdeschools.sqlite'
tmpdb = Path('data') / dbfile

# Specify the database connection path (in this case, a sqlite database in a file)
sqlite_conn = 'sqlite:///' + str(tmpdb)
# Run this cell to connect to database
engine = sqlalchemy.create_engine(sqlite_conn)
connection = engine.connect()
%sql engine
%%sql
/* This is a SQL cell! This particular line is inside a SQL comment. 
The following line let us view all the tables inside our SQL database. */
SELECT * FROM sqlite_master WHERE type='table';
Running query in 'sqlite:///data/cdeschools.sqlite'
type name tbl_name rootpage sql
table schools schools 2 CREATE TABLE schools (
"CDSCode" VARCHAR(14) NOT NULL,
"NCESDist" VARCHAR(7),
"NCESSchool" VARCHAR(5),
"StatusType" VARCHAR(7) NOT NULL,
"County" VARCHAR(15) NOT NULL,
"District" VARCHAR(74) NOT NULL,
"School" VARCHAR(89),
"Street" VARCHAR(62),
"StreetAbr" VARCHAR(59),
"City" VARCHAR(25),
"Zip" VARCHAR(10),
"State" VARCHAR(4),
"MailStreet" VARCHAR(70),
"MailStrAbr" VARCHAR(70),
"MailCity" VARCHAR(22),
"MailZip" VARCHAR(10),
"MailState" VARCHAR(4),
"Phone" VARCHAR(14),
"Ext" VARCHAR(6),
"Website" VARCHAR(96),
"OpenDate" DATE,
"ClosedDate" DATE,
"Charter" BOOLEAN,
"CharterNum" VARCHAR(4),
"FundingType" VARCHAR(23),
"DOC" VARCHAR(2) NOT NULL,
"DOCType" VARCHAR(42) NOT NULL,
"SOC" VARCHAR(4),
"SOCType" VARCHAR(41),
"EdOpsCode" VARCHAR(7),
"EdOpsName" VARCHAR(43),
"EILCode" VARCHAR(8),
"EILName" VARCHAR(31),
"GSoffered" VARCHAR(16),
"GSserved" VARCHAR(5),
"Virtual" VARCHAR(4),
"Magnet" BOOLEAN,
"Latitude" FLOAT,
"Longitude" FLOAT,
"AdmFName1" VARCHAR(17),
"AdmLName1" VARCHAR(22),
"AdmEmail1" VARCHAR(49),
"AdmFName2" VARCHAR(15),
"AdmLName2" VARCHAR(18),
"AdmEmail2" VARCHAR(40),
"AdmFName3" VARCHAR(8),
"AdmLName3" VARCHAR(15),
"AdmEmail3" VARCHAR(39),
"LastUpdate" DATE NOT NULL,
CHECK ("Charter" IN (0, 1)),
CHECK ("Magnet" IN (0, 1))
)
table frpm frpm 9041 CREATE TABLE frpm (
"Academic Year" VARCHAR(9),
"County Code" VARCHAR(4),
"District Code" INTEGER,
"School Code" VARCHAR(7),
"County Name" VARCHAR(15),
"District Name" VARCHAR(75),
"School Name" VARCHAR(85),
"District Type" VARCHAR(32),
"School Type" VARCHAR(41),
"Educational Option Type" VARCHAR(43),
"NSLP Provision Status" VARCHAR(24),
"Charter School (Y/N)" BOOLEAN,
"Charter School Number" VARCHAR(4),
"Charter Funding Type" VARCHAR(23),
"IRC" BOOLEAN,
"Low Grade" VARCHAR(5),
"High Grade" VARCHAR(14),
"Enrollment (K-12)" FLOAT,
"Free Meal Count (K-12)" FLOAT,
"Percent (%) Eligible Free (K-12)" FLOAT,
"FRPM Count (K-12)" FLOAT,
"Percent (%) Eligible FRPM (K-12)" FLOAT,
"Enrollment (Ages 5-17)" FLOAT,
"Free Meal Count (Ages 5-17)" FLOAT,
"Percent (%) Eligible Free (Ages 5-17)" FLOAT,
"FRPM Count (Ages 5-17)" FLOAT,
"Percent (%) Eligible FRPM (Ages 5-17)" FLOAT,
"2013-14 CALPADS Fall 1 Certification Status" BOOLEAN,
CHECK ("Charter School (Y/N)" IN (0, 1)),
CHECK ("IRC" IN (0, 1)),
CHECK ("2013-14 CALPADS Fall 1 Certification Status" IN (0, 1))
)
table satscores satscores 8814 CREATE TABLE satscores (
cds VARCHAR(14) NOT NULL,
rtype VARCHAR(1) NOT NULL,
sname VARCHAR(50),
dname VARCHAR(50),
cname VARCHAR(15),
enroll12 INTEGER NOT NULL,
"NumTstTakr" INTEGER NOT NULL,
"AvgScrRead" INTEGER,
"AvgScrMath" INTEGER,
"AvgScrWrite" INTEGER,
"NumGE1500" INTEGER,
"PctGE1500" FLOAT
)

As you can see from the summary above, our database contains 3 tables schools, frpm (short for free or reduced price meals), and satscores. Each of these tables contains some important information for our analysis.

The column on the right tells you what SQL command was run to produce each table. Don’t worry about the specifics for now, but note that the column on the right includes all the column names and data types in the table. Read more about the various data types in SQL here.

Let us also quickly see the first 5 rows from each of the tables.

%%sql

SELECT * 
FROM schools 
LIMIT 5
Running query in 'sqlite:///data/cdeschools.sqlite'
CDSCode NCESDist NCESSchool StatusType County District School Street StreetAbr City Zip State MailStreet MailStrAbr MailCity MailZip MailState Phone Ext Website OpenDate ClosedDate Charter CharterNum FundingType DOC DOCType SOC SOCType EdOpsCode EdOpsName EILCode EILName GSoffered GSserved Virtual Magnet Latitude Longitude AdmFName1 AdmLName1 AdmEmail1 AdmFName2 AdmLName2 AdmEmail2 AdmFName3 AdmLName3 AdmEmail3 LastUpdate
01100170000000 0691051 None Active Alameda Alameda County Office of Education None 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA (510) 887-0152 None www.acoe.org None None None None None 00 County Office of Education (COE) None None None None None None None None None None 37.658212 -122.09713 L Karen Monroe lkmonroe@acoe.org None None None None None None 2015-06-23
01100170109835 0691051 10546 Closed Alameda Alameda County Office of Education FAME Public Charter 39899 Balentine Drive, Suite 335 39899 Balentine Dr., Ste. 335 Newark 94560-5359 CA 39899 Balentine Drive, Suite 335 39899 Balentine Dr., Ste. 335 Newark 94560-5359 CA None None None 2005-08-29 2015-07-31 1 0728 Directly funded 00 County Office of Education (COE) 65 K-12 Schools (Public) TRAD Traditional ELEMHIGH Elementary-High Combination K-12 K-12 P 0 37.521436 -121.99391 None None None None None None None None None 2015-09-01
01100170112607 0691051 10947 Active Alameda Alameda County Office of Education Envision Academy for Arts & Technology 1515 Webster Street 1515 Webster St. Oakland 94612-3355 CA 1515 Webster Street 1515 Webster St. Oakland 94612 CA (510) 596-8901 None www.envisionacademy.org/ 2006-08-28 None 1 0811 Directly funded 00 County Office of Education (COE) 66 High Schools (Public) TRAD Traditional HS High School 9-12 9-12 N 0 37.80452 -122.26815 Laura Robell laura@envisionacademy.org None None None None None None 2015-06-18
01100170118489 0691051 12283 Closed Alameda Alameda County Office of Education Aspire California College Preparatory Academy 2125 Jefferson Avenue 2125 Jefferson Ave. Berkeley 94703-1414 CA 1001 22nd Avenue, Suite 100 1001 22nd Ave., Ste. 100 Oakland 94606 CA None None www.aspirepublicschools.org 2008-08-21 2015-06-30 1 1049 Directly funded 00 County Office of Education (COE) 66 High Schools (Public) TRAD Traditional HS High School 9-12 9-12 N 0 37.868991 -122.27844 None None None None None None None None None 2015-07-01
01100170123968 0691051 12844 Active Alameda Alameda County Office of Education Community School for Creative Education 2111 International Boulevard 2111 International Blvd. Oakland 94606-4903 CA 2111 International Boulevard 2111 International Blvd. Oakland 94606-4903 CA (510) 686-4131 None www.communityschoolforcreativeeducation.org 2011-08-22 None 1 1284 Directly funded 00 County Office of Education (COE) 60 Elementary Schools (Public) TRAD Traditional ELEM Elementary K-8 K-7 N 0 37.784648 -122.23863 Clifford Thompson cliffordt@communityschoolforcreativeeducation.org None None None None None None 2016-07-18
%%sql

SELECT * 
FROM frpm 
LIMIT 5
Running query in 'sqlite:///data/cdeschools.sqlite'
Academic Year County Code District Code School Code County Name District Name School Name District Type School Type Educational Option Type NSLP Provision Status Charter School (Y/N) Charter School Number Charter Funding Type IRC Low Grade High Grade Enrollment (K-12) Free Meal Count (K-12) Percent (%) Eligible Free (K-12) FRPM Count (K-12) Percent (%) Eligible FRPM (K-12) Enrollment (Ages 5-17) Free Meal Count (Ages 5-17) Percent (%) Eligible Free (Ages 5-17) FRPM Count (Ages 5-17) Percent (%) Eligible FRPM (Ages 5-17) 2013-14 CALPADS Fall 1 Certification Status
2014-2015 01 10017 0109835 Alameda Alameda County Office of Education FAME Public Charter County Office of Education (COE) K-12 Schools (Public) Traditional None 1 0728 Directly funded 1 K 12 1087.0 565.0 0.5197792088316467 715.0 0.6577736890524379 1070.0 553.0 0.516822429906542 702.0 0.6560747663551402 1
2014-2015 01 10017 0112607 Alameda Alameda County Office of Education Envision Academy for Arts & Technology County Office of Education (COE) High Schools (Public) Traditional None 1 0811 Directly funded 1 9 12 395.0 186.0 0.4708860759493671 186.0 0.4708860759493671 376.0 182.0 0.48404255319148937 182.0 0.48404255319148937 1
2014-2015 01 10017 0118489 Alameda Alameda County Office of Education Aspire California College Preparatory Academy County Office of Education (COE) High Schools (Public) Traditional None 1 1049 Directly funded 1 9 12 244.0 134.0 0.5491803278688525 175.0 0.7172131147540983 230.0 128.0 0.5565217391304348 168.0 0.7304347826086957 1
2014-2015 01 10017 0123968 Alameda Alameda County Office of Education Community School for Creative Education County Office of Education (COE) Elementary Schools (Public) Traditional Breakfast Provision 2 1 1284 Directly funded 1 K 8 191.0 113.0 0.5916230366492147 139.0 0.7277486910994765 190.0 113.0 0.5947368421052631 139.0 0.7315789473684211 1
2014-2015 01 10017 0124172 Alameda Alameda County Office of Education Yu Ming Charter County Office of Education (COE) Elementary Schools (Public) Traditional None 1 1296 Directly funded 1 K 8 257.0 14.0 0.054474708171206226 21.0 0.08171206225680934 257.0 14.0 0.054474708171206226 21.0 0.08171206225680934 1
%%sql

SELECT * 
FROM satscores 
LIMIT 5
Running query in 'sqlite:///data/cdeschools.sqlite'
cds rtype sname dname cname enroll12 NumTstTakr AvgScrRead AvgScrMath AvgScrWrite NumGE1500 PctGE1500
00000000000000 X None None None 496901 210706 489 500 484 93334 44.3
01000000000000 C None None Alameda 16978 8855 516 536 517 4900 55.34
01100170000000 D None Alameda County Office of Education Alameda 398 88 418 418 417 14 15.91
01100170109835 S FAME Public Charter Alameda County Office of Education Alameda 62 17 503 546 505 9 52.94
01100170112607 S Envision Academy for Arts & Technology Alameda County Office of Education Alameda 75 71 397 387 395 5 7.04

Without further ado, let’s begin discussing SQL’s syntax!

Syntax#

It is common practice to capitalize all keywords in SQL queries. We shall stick to this common practice throughout this section.

While pandas doesn’t differentiate between single quotes (‘’) and double quotes (“”), SQL does. You can read more about how Jupyter SQL cell magics handles quotes here. However, this may depend on the type of SQL you’re using - other types only use single quotes and no double quotes.

Furthermore, SQL queries have a certain order (i.e. they ought to start with the SELECT statement and end with the LIMIT clause if it exists). The order is as follows:

SELECT <columns>
FROM <table>
[INNER/LEFT/RIGHT/OUTER JOIN <table>]
[ON <columns> = <columns>]
[WHERE <condition>]
[GROUP BY <columns>]
[ORDER BY <columns>]
[LIMIT <rows>]

In the above order, the hard bracket []denote optional portions (note that everything except SELECT and FROM are optional). We have stuck to this order when making this chapter.

SELECT#

Every single SQL query must include a SELECT statement with a FROM command. The SELECT statement tells SQL which columns you are interested in, and the FROM command tells SQL which tables those columns are in. To select all the columns from a table, we can use * in the select statement. For example, let us select all the columns from the table schools.

%%sql

SELECT * 
FROM schools 
Running query in 'sqlite:///data/cdeschools.sqlite'
CDSCode NCESDist NCESSchool StatusType County District School Street StreetAbr City Zip State MailStreet MailStrAbr MailCity MailZip MailState Phone Ext Website OpenDate ClosedDate Charter CharterNum FundingType DOC DOCType SOC SOCType EdOpsCode EdOpsName EILCode EILName GSoffered GSserved Virtual Magnet Latitude Longitude AdmFName1 AdmLName1 AdmEmail1 AdmFName2 AdmLName2 AdmEmail2 AdmFName3 AdmLName3 AdmEmail3 LastUpdate
01100170000000 0691051 None Active Alameda Alameda County Office of Education None 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA (510) 887-0152 None www.acoe.org None None None None None 00 County Office of Education (COE) None None None None None None None None None None 37.658212 -122.09713 L Karen Monroe lkmonroe@acoe.org None None None None None None 2015-06-23
01100170109835 0691051 10546 Closed Alameda Alameda County Office of Education FAME Public Charter 39899 Balentine Drive, Suite 335 39899 Balentine Dr., Ste. 335 Newark 94560-5359 CA 39899 Balentine Drive, Suite 335 39899 Balentine Dr., Ste. 335 Newark 94560-5359 CA None None None 2005-08-29 2015-07-31 1 0728 Directly funded 00 County Office of Education (COE) 65 K-12 Schools (Public) TRAD Traditional ELEMHIGH Elementary-High Combination K-12 K-12 P 0 37.521436 -121.99391 None None None None None None None None None 2015-09-01
01100170112607 0691051 10947 Active Alameda Alameda County Office of Education Envision Academy for Arts & Technology 1515 Webster Street 1515 Webster St. Oakland 94612-3355 CA 1515 Webster Street 1515 Webster St. Oakland 94612 CA (510) 596-8901 None www.envisionacademy.org/ 2006-08-28 None 1 0811 Directly funded 00 County Office of Education (COE) 66 High Schools (Public) TRAD Traditional HS High School 9-12 9-12 N 0 37.80452 -122.26815 Laura Robell laura@envisionacademy.org None None None None None None 2015-06-18
01100170118489 0691051 12283 Closed Alameda Alameda County Office of Education Aspire California College Preparatory Academy 2125 Jefferson Avenue 2125 Jefferson Ave. Berkeley 94703-1414 CA 1001 22nd Avenue, Suite 100 1001 22nd Ave., Ste. 100 Oakland 94606 CA None None www.aspirepublicschools.org 2008-08-21 2015-06-30 1 1049 Directly funded 00 County Office of Education (COE) 66 High Schools (Public) TRAD Traditional HS High School 9-12 9-12 N 0 37.868991 -122.27844 None None None None None None None None None 2015-07-01
01100170123968 0691051 12844 Active Alameda Alameda County Office of Education Community School for Creative Education 2111 International Boulevard 2111 International Blvd. Oakland 94606-4903 CA 2111 International Boulevard 2111 International Blvd. Oakland 94606-4903 CA (510) 686-4131 None www.communityschoolforcreativeeducation.org 2011-08-22 None 1 1284 Directly funded 00 County Office of Education (COE) 60 Elementary Schools (Public) TRAD Traditional ELEM Elementary K-8 K-7 N 0 37.784648 -122.23863 Clifford Thompson cliffordt@communityschoolforcreativeeducation.org None None None None None None 2016-07-18
01100170124172 0691051 12901 Active Alameda Alameda County Office of Education Yu Ming Charter 1086 Alcatraz Avenue 1086 Alcatraz Ave. Oakland 94608-1265 CA 1086 Alcatraz Avenue 1086 Alcatraz Ave. Oakland 94608-1265 CA (510) 452-2063 None www.yumingschool.org 2011-08-09 None 1 1296 Directly funded 00 County Office of Education (COE) 60 Elementary Schools (Public) TRAD Traditional ELEM Elementary K-8 K-5 N 0 37.847375 -122.28356 Sue Park spark@yumingschool.org None None None None None None 2016-05-10
01100170125567 0691051 13008 Active Alameda Alameda County Office of Education Urban Montessori Charter 5328 Brann Street 5328 Brann St. Oakland 94619-3312 CA 5328 Brann Street 5328 Brann St. Oakland 94619-3312 CA (510) 842-1181 None www.urbanmontessori.org 2012-08-27 None 1 1383 Directly funded 00 County Office of Education (COE) 60 Elementary Schools (Public) TRAD Traditional ELEM Elementary K-8 K-5 N 0 37.778352 -122.1895 David Castillo davidc@urbanmontessori.org None None None None None None 2015-10-09
01100170130302 0691051 None Closed Alameda Alameda County Office of Education Technical, Agricultural & Nat. 597 C Street 597 C St. Hayward 94541 CA 597 C Street 597 C St. Hayward 94541 CA None None None 1980-07-01 1989-06-30 0 None None 00 County Office of Education (COE) 10 County Community None None UG Ungraded None None None None None None None None None None None None None None None 1999-06-24
01100170130401 0691051 09264 Active Alameda Alameda County Office of Education Alameda County Juvenile Hall/Court 2500 Fairmont Avenue 2500 Fairmont Ave. San Leandro 94578-1005 CA 2500 Fairmont Avenue 2500 Fairmont Ave. San Leandro 94578-1005 CA (510) 667-7435 None www.acoe.org 1980-07-01 None 0 None None 00 County Office of Education (COE) 14 Juvenile Court Schools JUV Juvenile Court School HS High School K-12 8-12 N 0 37.712878 -122.11173 Michael Hermosillo mhermosillo@acoe.org None None None None None None 2016-05-10
01100170130419 0691051 06830 Active Alameda Alameda County Office of Education Alameda County Community 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA (510) 670-4590 None www.acoe.org 1980-07-01 None 0 None None 00 County Office of Education (COE) 10 County Community COMM County Community School HS High School K-12 7-12 P 0 37.658212 -122.09713 Carolyn Hobbs chobbs@acoe.org None None None None None None 2016-10-06
Truncated to displaylimit of 10.

AS#

We can rename columns using the AS keyword within the SELECT statement. For example, let us select just the school code column CDSCode from the table schools, and rename it to be County_District_Code.

%%sql

SELECT CDSCode AS County_District_Code
FROM schools 
Running query in 'sqlite:///data/cdeschools.sqlite'
County_District_Code
01100170000000
01100170109835
01100170112607
01100170118489
01100170123968
01100170124172
01100170125567
01100170130302
01100170130401
01100170130419
Truncated to displaylimit of 10.

CAST#

We can use the CAST function within the SELECT statement to convert the data types of columns if necessary. For example, we can see from the summary of all the tables that the District Code column in the frpm table is stored as an integer. However, since the district code is a qualitative ordinal variable rather than a quantative one (review the types of variables here), it would make more sense to store it as a varchar (review the types of data in SQL here). Let us make this change.

%%sql

SELECT CAST("District Code" AS varchar)
FROM frpm
Running query in 'sqlite:///data/cdeschools.sqlite'
CAST("District Code" AS varchar)
None
None
10017
10017
10017
10017
10017
10017
10017
10017
Truncated to displaylimit of 10.

If you like, you can rename this new column to be something easier to interpret.

CASE#

The CASE expression helps us create new categorical columns based on other columns in SQL. The general syntax is:

CASE
    WHEN condition1 THEN output1
    WHEN condition2 THEN output2
    WHEN conditionN THEN outputN
    ELSE outputN+1
END AS colname

For example, let us generate a new column which classifies school sizes into small, medium or large based on the number of students they have enrolled in grade 12 (represented by the enroll12 column in the satscores table).

%%sql

SELECT sname, enroll12, CASE
        WHEN enroll12 <= 50 THEN "small"
        WHEN enroll12 > 2000 THEN "large"
        ELSE "medium"
    END AS school_size
FROM satscores
Running query in 'sqlite:///data/cdeschools.sqlite'
sname enroll12 school_size
None 496901 large
None 16978 large
None 398 medium
FAME Public Charter 62 medium
Envision Academy for Arts & Technology 75 medium
Aspire California College Preparatory Academy 61 medium
None 922 medium
Alameda Science and Technology Institute 36 small
Nea Community Learning Center 12 small
Alameda High 465 medium
Truncated to displaylimit of 10.

You will notice that the first 3 rows have no school name. That is because they correspond to state/county/district-wide data rather than data from a specific school. We can filter the dataset to only include schools using the WHERE clause.

DISTINCT#

The DISTINCT keyword returns all the unique values in a column. For example, the code below returns all the unique values in the column rtype from the table satscores.

%%sql

SELECT DISTINCT(rtype)
FROM satscores
Running query in 'sqlite:///data/cdeschools.sqlite'
rtype
X
C
D
S

As an aside, the column rtype represents the type of region being covered by the row. S represents a single school, D represents a school district, C represents a county and X represents the entire state.

JOIN#

Joins in SQL work very similarly to joins in pandas. The general syntax of a JOIN clause is as follows:

SELECT <columns>
FROM <table>
(INNER)/(LEFT)/(RIGHT)/(FULL OUTER) JOIN <table>
ON <columns> = <columns>

For example, let us inner join the schools and satscores tables on their shared CDS Code column.

%%sql

SELECT schools."County", schools."District", schools."School", 
    satscores."NumTstTakr", satscores."AvgScrRead", satscores."AvgScrMath", satscores."PctGE1500"
FROM schools
INNER JOIN satscores
ON schools."CDSCode" = satscores.cds
Running query in 'sqlite:///data/cdeschools.sqlite'
County District School NumTstTakr AvgScrRead AvgScrMath PctGE1500
Alameda Alameda County Office of Education None 88 418 418 15.91
Alameda Alameda County Office of Education FAME Public Charter 17 503 546 52.94
Alameda Alameda County Office of Education Envision Academy for Arts & Technology 71 397 387 7.04
Alameda Alameda County Office of Education Aspire California College Preparatory Academy 0 None None None
Alameda Alameda Unified None 544 521 546 61.21
Alameda Alameda Unified Alameda Science and Technology Institute 36 562 590 80.56
Alameda Alameda Unified Nea Community Learning Center 6 None None None
Alameda Alameda Unified Alameda High 325 543 573 70.46
Alameda Alameda Unified Alameda Community Learning Center 20 575 593 80.0
Alameda Alameda Unified Alternatives in Action 23 352 350 4.35
Truncated to displaylimit of 10.

As you can see, when specifying columns in the query above, I specified which tables the columns came from. Technically this isn’t necessary if both tables don’t share the column name, but it is still good practice to do so.

WHERE#

The WHERE clause helps us filter rows we’re interested in. For example, we can filter the satscores dataset to include all schools where over 50% of the school got over a 1500 on the SAT (as captured in the PctGE1500 column).

%%sql

SELECT *
FROM satscores
WHERE PctGE1500 >= 50
Running query in 'sqlite:///data/cdeschools.sqlite'
cds rtype sname dname cname enroll12 NumTstTakr AvgScrRead AvgScrMath AvgScrWrite NumGE1500 PctGE1500
01000000000000 C None None Alameda 16978 8855 516 536 517 4900 55.34
01100170109835 S FAME Public Charter Alameda County Office of Education Alameda 62 17 503 546 505 9 52.94
01611190000000 D None Alameda Unified Alameda 922 544 521 546 519 333 61.21
01611190106401 S Alameda Science and Technology Institute Alameda Unified Alameda 36 36 562 590 555 29 80.56
01611190130229 S Alameda High Alameda Unified Alameda 465 325 543 573 543 229 70.46
01611190130609 S Alameda Community Learning Center Alameda Unified Alameda 26 20 575 593 581 16 80.0
01611270000000 D None Albany City Unified Alameda 289 178 572 599 574 136 76.4
01611270130450 S Albany High Albany City Unified Alameda 273 178 572 599 574 136 76.4
01611430000000 D None Berkeley Unified Alameda 885 490 566 562 559 337 68.78
01611430131177 S Berkeley High Berkeley Unified Alameda 741 450 583 580 574 335 74.44
Truncated to displaylimit of 10.

If we’re interested in filtering based on string data, it is recommended to use the LIKE operator. As a reminder, while pandas doesn’t differentiate between single quotes (‘’) and double quotes (“”), SQL does. When I find it easiest to just use double quotes when denoting (string) column names in Jupyter SQL cell magic, you can read more about the details here. For example, let us filter the dataset to include all the rows corresponding to a school (as indicated by the rtype column being S).

%%sql

SELECT *
FROM satscores
WHERE rtype LIKE "S" 
Running query in 'sqlite:///data/cdeschools.sqlite'
cds rtype sname dname cname enroll12 NumTstTakr AvgScrRead AvgScrMath AvgScrWrite NumGE1500 PctGE1500
01100170109835 S FAME Public Charter Alameda County Office of Education Alameda 62 17 503 546 505 9 52.94
01100170112607 S Envision Academy for Arts & Technology Alameda County Office of Education Alameda 75 71 397 387 395 5 7.04
01100170118489 S Aspire California College Preparatory Academy Alameda County Office of Education Alameda 61 0 None None None None None
01611190106401 S Alameda Science and Technology Institute Alameda Unified Alameda 36 36 562 590 555 29 80.56
01611190119222 S Nea Community Learning Center Alameda Unified Alameda 12 6 None None None None None
01611190130229 S Alameda High Alameda Unified Alameda 465 325 543 573 543 229 70.46
01611190130609 S Alameda Community Learning Center Alameda Unified Alameda 26 20 575 593 581 16 80.0
01611190130625 S Alternatives in Action Alameda Unified Alameda 41 23 352 350 343 1 4.35
01611190132878 S Encinal High Alameda Unified Alameda 238 132 483 504 476 56 42.42
01611190134304 S Island High (Continuation) Alameda Unified Alameda 100 2 None None None None None
Truncated to displaylimit of 10.

When using the LIKE operator, the character % represents a wildcard which matches any string (similar to the .* pattern in RegEx). For example, let us say we’re interested in all high school names which end with the word ‘High’. As we’re only interested in high schools, we want to put filters on both the rtype and the sname columns. We extract those rows using the code below.

%%sql

SELECT *
FROM satscores
WHERE rtype LIKE "S" 
AND sname LIKE "%High" 
Running query in 'sqlite:///data/cdeschools.sqlite'
cds rtype sname dname cname enroll12 NumTstTakr AvgScrRead AvgScrMath AvgScrWrite NumGE1500 PctGE1500
01611190130229 S Alameda High Alameda Unified Alameda 465 325 543 573 543 229 70.46
01611190132878 S Encinal High Alameda Unified Alameda 238 132 483 504 476 56 42.42
01611270130450 S Albany High Albany City Unified Alameda 273 178 572 599 574 136 76.4
01611430122697 S REALM Charter High Berkeley Unified Alameda 76 36 384 360 386 2 5.56
01611430131177 S Berkeley High Berkeley Unified Alameda 741 450 583 580 574 335 74.44
01611500132225 S Castro Valley High Castro Valley Unified Alameda 688 438 527 556 524 273 62.33
01611760130062 S American High Fremont Unified Alameda 457 309 573 608 592 243 78.64
01611760134270 S Irvington High Fremont Unified Alameda 500 364 590 634 606 309 84.89
01611760134452 S John F. Kennedy High Fremont Unified Alameda 368 179 485 509 485 77 43.02
01611760135244 S Mission San Jose High Fremont Unified Alameda 518 453 653 699 671 421 92.94
Truncated to displaylimit of 10.

% was just an example, read more about the different wildcards in SQL here.

GROUP BY#

The SQL GROUP BY statement accomplishes the exact same task as the .groupby() in pandas (review the Pandas subchapter if necessary). To specify which column you want to group by, you simply add that into the query, following the syntax below:

SELECT <columns>
FROM <table>
GROUP BY <columns>

Aggregate#

Specifying the aggregate function in SQL is not always intuitive for those who are familar with pandas. In SQL, you specify the aggregate function inside the SELECT statement. For example, let us say you are interested in the total number of test takers across all high schools in the different countries. In pandas, you could do:

satscores[satscores['rtype'] == 'S'].groupby('cname')['NumTstTakr'].sum()

However, in SQL, you must write the code below.

%%sql

SELECT cname, SUM(NumTstTakr)
FROM satscores
WHERE rtype = "S"
GROUP BY cname
Running query in 'sqlite:///data/cdeschools.sqlite'
cname SUM(NumTstTakr)
Alameda 8855
Amador 97
Butte 636
Calaveras 125
Colusa 104
Contra Costa 7027
Del Norte 81
El Dorado 938
Fresno 5907
Glenn 107
Truncated to displaylimit of 10.

An advantage of the SQL approach is that it’s quite easy to compute multiple aggregations at once.

%%sql

SELECT cname, SUM(NumTstTakr), AVG(PctGE1500)
FROM satscores
WHERE rtype = "S"
GROUP BY cname
Running query in 'sqlite:///data/cdeschools.sqlite'
cname SUM(NumTstTakr) AVG(PctGE1500)
Alameda 8855 40.14071428571429
Amador 97 59.925000000000004
Butte 636 51.753
Calaveras 125 62.885000000000005
Colusa 104 30.33
Contra Costa 7027 43.969722222222224
Del Norte 81 45.45
El Dorado 938 66.09428571428572
Fresno 5907 25.784848484848485
Glenn 107 33.626666666666665
Truncated to displaylimit of 10.

SQL comes with multiple aggregate functions, including COUNT, MIN, MAX, SUM, AVG, etc. Feel free to google around for more examples of aggregate functions and creative ways to use the functions above!

HAVING#

The HAVING clause in a SQL GROUP BY is equivalent to the filter() in a pandas groupby (review the pandas filter here if necessary). It goes directly after the GROUP BY statement. For example, suppose we want to identify school districts where the average percentage of students scoring above 1500 on the SAT is less than 10%. We could write the following SQL code to achieve this.

%%sql

SELECT dname, cname, AVG(PctGE1500)
FROM satscores
WHERE rtype = "S"
GROUP BY dname
HAVING AVG(PctGE1500) <= 10
Running query in 'sqlite:///data/cdeschools.sqlite'
dname cname AVG(PctGE1500)
Calistoga Joint Unified Napa 7.69
Compton Unified Los Angeles 7.016666666666667
Delhi Unified Merced 9.09
Farmersville Unified Tulare 7.59
Firebaugh-Las Deltas Unified Fresno 4.76
Golden Plains Unified Fresno 7.27
Le Grand Union High Merced 8.47
Lucerne Valley Unified San Bernardino 0.0
Lynwood Unified Los Angeles 9.934999999999999
Mendota Unified Fresno 6.45
Truncated to displaylimit of 10.

Note how we easily performed both an aggregation and a filter in the code above. The ease of writing this is an advantage of SQL’s groupby approach.

ORDER BY#

Orders the table by a given column, in ascending (using ASC) or descending (using DESC) order. For example, let us say we’re interested in which schools have the highest proportion of students scoring above a 1500 on the SAT. We could use the code below to accomplish this.

%%sql

SELECT *
FROM satscores
WHERE rtype = "S"
ORDER BY PctGE1500 DESC
Running query in 'sqlite:///data/cdeschools.sqlite'
cds rtype sname dname cname enroll12 NumTstTakr AvgScrRead AvgScrMath AvgScrWrite NumGE1500 PctGE1500
30664313030616 S Oxford Academy Anaheim Union High Orange 176 167 627 669 634 164 98.2
19642121931880 S Whitney (Gretchen) High ABC Unified Los Angeles 164 154 639 687 661 151 98.05
43694684334421 S Lynbrook High Fremont Union High Santa Clara 472 398 639 698 664 382 95.98
43694684334462 S Monta Vista High Fremont Union High Santa Clara 608 513 638 691 657 487 94.93
43695344337762 S Saratoga High Los Gatos-Saratoga Joint Union High Santa Clara 362 319 639 674 650 302 94.67
43696414332904 S Henry M. Gunn High Palo Alto Unified Santa Clara 447 330 642 686 646 310 93.94
44104474430252 S Pacific Collegiate Charter Santa Cruz County Office of Education Santa Cruz 71 66 638 638 630 62 93.94
19642611995299 S Rancho Learning Center (Alternative) Arcadia Unified Los Angeles 38 32 605 641 601 30 93.75
07616300734244 S Miramonte High Acalanes Union High Contra Costa 253 218 628 633 633 203 93.12
01611760135244 S Mission San Jose High Fremont Unified Alameda 518 453 653 699 671 421 92.94
Truncated to displaylimit of 10.

LIMIT#

Limits the number of rows to be returned. For example, the code below accomplishes the same thing as schools.head(5) in pandas.

%%sql

SELECT *
FROM schools
LIMIT 5
Running query in 'sqlite:///data/cdeschools.sqlite'
CDSCode NCESDist NCESSchool StatusType County District School Street StreetAbr City Zip State MailStreet MailStrAbr MailCity MailZip MailState Phone Ext Website OpenDate ClosedDate Charter CharterNum FundingType DOC DOCType SOC SOCType EdOpsCode EdOpsName EILCode EILName GSoffered GSserved Virtual Magnet Latitude Longitude AdmFName1 AdmLName1 AdmEmail1 AdmFName2 AdmLName2 AdmEmail2 AdmFName3 AdmLName3 AdmEmail3 LastUpdate
01100170000000 0691051 None Active Alameda Alameda County Office of Education None 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA 313 West Winton Avenue 313 West Winton Ave. Hayward 94544-1136 CA (510) 887-0152 None www.acoe.org None None None None None 00 County Office of Education (COE) None None None None None None None None None None 37.658212 -122.09713 L Karen Monroe lkmonroe@acoe.org None None None None None None 2015-06-23
01100170109835 0691051 10546 Closed Alameda Alameda County Office of Education FAME Public Charter 39899 Balentine Drive, Suite 335 39899 Balentine Dr., Ste. 335 Newark 94560-5359 CA 39899 Balentine Drive, Suite 335 39899 Balentine Dr., Ste. 335 Newark 94560-5359 CA None None None 2005-08-29 2015-07-31 1 0728 Directly funded 00 County Office of Education (COE) 65 K-12 Schools (Public) TRAD Traditional ELEMHIGH Elementary-High Combination K-12 K-12 P 0 37.521436 -121.99391 None None None None None None None None None 2015-09-01
01100170112607 0691051 10947 Active Alameda Alameda County Office of Education Envision Academy for Arts & Technology 1515 Webster Street 1515 Webster St. Oakland 94612-3355 CA 1515 Webster Street 1515 Webster St. Oakland 94612 CA (510) 596-8901 None www.envisionacademy.org/ 2006-08-28 None 1 0811 Directly funded 00 County Office of Education (COE) 66 High Schools (Public) TRAD Traditional HS High School 9-12 9-12 N 0 37.80452 -122.26815 Laura Robell laura@envisionacademy.org None None None None None None 2015-06-18
01100170118489 0691051 12283 Closed Alameda Alameda County Office of Education Aspire California College Preparatory Academy 2125 Jefferson Avenue 2125 Jefferson Ave. Berkeley 94703-1414 CA 1001 22nd Avenue, Suite 100 1001 22nd Ave., Ste. 100 Oakland 94606 CA None None www.aspirepublicschools.org 2008-08-21 2015-06-30 1 1049 Directly funded 00 County Office of Education (COE) 66 High Schools (Public) TRAD Traditional HS High School 9-12 9-12 N 0 37.868991 -122.27844 None None None None None None None None None 2015-07-01
01100170123968 0691051 12844 Active Alameda Alameda County Office of Education Community School for Creative Education 2111 International Boulevard 2111 International Blvd. Oakland 94606-4903 CA 2111 International Boulevard 2111 International Blvd. Oakland 94606-4903 CA (510) 686-4131 None www.communityschoolforcreativeeducation.org 2011-08-22 None 1 1284 Directly funded 00 County Office of Education (COE) 60 Elementary Schools (Public) TRAD Traditional ELEM Elementary K-8 K-7 N 0 37.784648 -122.23863 Clifford Thompson cliffordt@communityschoolforcreativeeducation.org None None None None None None 2016-07-18

Note: SQL is designed for very fast and efficient use with extremely large databases. However, if you try to access all the rows at once, SQL won’t prevent them from being returned. This could crash your computer as you may not have enough space to accomodate such a large output. Please always LIMIT your SQL queries unless you are 100% sure the output will not be excessively large. We haven’t limited our queries in this chapter because we were 100% sure that we are working with small-ish datasets; otherwise we would definitely limit our queries.