John Snow’s 1854 map of the Soho cholera outbreak is a foundational case study in data visualization 🗺️. By plotting deaths as individual black bars at specific addresses, Snow provided spatial evidence that challenged the prevailing “miasma” theory (the belief that disease spread through “bad air”) and identified the Broad Street pump as the source of the contagion 🧪.
This video provides a modern geospatial walkthrough of how Snow’s data is visualized today, which can help your students see the connection between 19th-century methods and current technology.
To get started with your lab, we will use a digitized version of the 1854 Soho data that includes modern GPS coordinates.
The most reliable source for this exercise is the Robin Wilson dataset, which has been formatted into CSVs. You can read these directly into Pandas using the URLs below:
https://raw.githubusercontent.com/JimGrum/JohnSnow/master/data/deaths.csvhttps://raw.githubusercontent.com/JimGrum/JohnSnow/master/data/pumps.csvThis script handles the heavy lifting: it loads the data, centers a map on the historic Broad Street pump, and layers on the density.
Folium is a powerful Python library used to create interactive maps 🗺️. It acts as a bridge between Python’s data manipulation capabilities and Leaflet.js, a popular JavaScript library for mobile-friendly interactive maps.
With Folium, you can:
To get a map running, you only need a few lines. This example centers a map on the Broad Street Pump coordinates and adds a simple marker.
import folium
# 1. Create a Map object
# 'location' takes [latitude, longitude]
# 'tiles' changes the background style
study_area = folium.Map(location=[51.5132, -0.1367], zoom_start=17, tiles="OpenStreetMap")
# 2. Add a simple Marker
folium.Marker(
location=[51.5132, -0.1367],
popup="Broad Street Pump",
tooltip="Click for info",
icon=folium.Icon(color="red", icon="info-sign")
).add_to(study_area)
# 3. Display the map
study_area
import pandas as pd
import folium
from folium.plugins import HeatMap
# 1. Load the data
#deaths = pd.read_csv("https://raw.githubusercontent.com/JimGrum/JohnSnow/master/data/deaths.csv")
#pumps = pd.read_csv("https://raw.githubusercontent.com/JimGrum/JohnSnow/master/data/pumps.csv")
import numpy as np
# 1. Generate Synthetic Data
# Define the main Broad Street Pump location
broad_st_pump = [51.5132, -0.1367]
# Create 50 deaths clustered tightly around the Broad Street pump
# np.random.normal adds a small 'jitter' to the coordinates
lat_cluster = np.random.normal(51.5132, 0.0005, 50)
lon_cluster = np.random.normal(-0.1367, 0.0005, 50)
# Create a small DataFrame for these synthetic deaths
deaths = pd.DataFrame({'Lat': lat_cluster, 'Lon': lon_cluster})
# Create a simple DataFrame for 2 pumps
pumps = pd.DataFrame({
'Pump_Name': ['Broad Street Pump', 'Oxford Street Pump'],
'Lat': [51.5132, 51.5150],
'Lon': [-0.1367, -0.1350]
})
# 🛠️ Now plot! Without looking at the code below!
# 2. Initialize the map (Centered on Soho, London)
# Coordinates: 51.5132, -0.1367
m = folium.Map(location=[51.5132, -0.1367], zoom_start=17, tiles="cartodbpositron")
# 3. Add Pumps as markers
for _, pump in pumps.iterrows():
folium.Marker(
location=[pump['Lat'], pump['Lon']],
popup=pump['Pump_Name'],
icon=folium.Icon(color='blue', icon='tint')
).add_to(m)
# 4. Create the HeatMap
# Because we have individual records, we just need a list of [lat, lon] pairs.
heat_data = deaths[['Lat', 'Lon']].values.tolist()
# The 'radius' and 'blur' determine how the "heat" spreads between points
HeatMap(heat_data, radius=15, blur=20).add_to(m)
# 5. Display the map
m
In this setup, we didn’t specify a “weight” for the points. Folium’s HeatMap simply looks at the coordinate list and says, “There is 1 death at this exact spot.” When ten rows have nearly identical coordinates, the color turns from cool blue to a “hot” red.
Since we are trying to prove a causal link between the pumps and the deaths, the visual contrast is key.
Looking at the code above, the radius and blur parameters in HeatMap are essentially your “statistical tuning knobs.” If you set the radius too high, the whole map becomes a red blob; too low, and it looks like a scattered rash.
How do you think changing the radius might affect your students’ ability to identify the specific pump responsible for the outbreak? 🧐
Interactive John Snow Map Tutorial This video demonstrates how to take raw CSV data and transform it into a dynamic Folium map, which is exactly what we are doing with the cholera records.
Visualizing Uncertainty: How we can use Python to show where the data might be “fuzzy” because of how it was digitized from a paper map 📜?
Scavenger hunt exercise: generate new data and determine source of COVID-19 epidemic.
This is a way to bridge 19th-century epidemiological methods with a 21st-century context. We are going to move the “detective work” to Wuhan, China, focusing on the early days of the COVID-19 pandemic.
In this scenario, your students are “Digital Epidemiologists.” They have been handed a “noisy” dataset of hospital admissions and must determine if there is a single point of origin or if the spread is truly random.
The Backstory: It’s early January 2020. Hospitals across Wuhan are reporting a “pneumonia of unknown cause.” Your task is to map the first 500 reported cases. If John Snow was right, the “Pump” (the source) will be at the heart of the highest density cluster.
Students will run this block first to create their “Evidence Files” (cases.csv and points_of_interest.csv).
import pandas as pd
import numpy as np
# 1. Set the "Hidden" Source: Huanan Seafood Market
# Coordinates: 30.6195, 114.2577
market_lat, market_lon = 30.6195, 114.2577
# 2. Generate 500 Synthetic Cases
# 70% of cases are tightly clustered around the market (The Source)
cluster_count = 350
cluster_lats = np.random.normal(market_lat, 0.005, cluster_count)
cluster_lons = np.random.normal(market_lon, 0.005, cluster_count)
# 30% are scattered randomly across the city (Community spread/noise)
noise_count = 150
noise_lats = np.random.uniform(30.50, 30.70, noise_count)
noise_lons = np.random.uniform(114.20, 114.40, noise_count)
# Combine into a DataFrame
df_cases = pd.DataFrame({
'case_id': range(500),
'lat': np.concatenate([cluster_lats, noise_lats]),
'lon': np.concatenate([cluster_lons, noise_lons])
})
# 3. List of Potential "Sources" (The Scavenger Hunt Targets)
df_pois = pd.DataFrame({
'name': ['Wuhan International Plaza', 'Huanan Seafood Market', 'Hankou Railway Station', 'Wuhan CDC'],
'lat': [30.584, 30.6195, 30.618, 30.612],
'lon': [114.271, 114.2577, 114.250, 114.265]
})
df_cases.to_csv('wuhan_cases.csv', index=False)
df_pois.to_csv('wuhan_pois.csv', index=False)
print("Data Generated! You now have 'wuhan_cases.csv' and 'wuhan_pois.csv'.")
cases = df_cases
pois = df_pois
Now, students can take this boilerplate. Their goal is to visualize the data and answer the Investigation Questions below.
import pandas as pd
import folium
from folium.plugins import HeatMap
# LOAD DATA
cases = pd.read_csv('wuhan_cases.csv')
pois = pd.read_csv('wuhan_pois.csv')
# INITIALIZE MAP
# Center on Wuhan
m = folium.Map(location=[30.6, 114.3], zoom_start=13, tiles='cartodbpositron')
# TASK 1: Create a Heatmap of the 'cases'
heat_data = cases[['lat', 'lon']].values.tolist()
HeatMap(heat_data, radius=12, blur=15).add_to(m)
# TASK 2: Add markers for the POIs (Points of Interest)
# Use a different color to distinguish them from the 'heat'
for _, poi in pois.iterrows():
folium.Marker(
location=[poi['lat'], poi['lon']],
popup=poi['name'],
icon=folium.Icon(color='black', icon='question-sign')
).add_to(m)
m.save('wuhan_investigation.html')
m
Encourage students to tweak the radius and blur in the HeatMap function. If they set the radius to 2, the map will look like a scattered mess; if they set it to 50, the entire city will look like it’s on fire. Finding the “just right” visualization is part of the data viz craft!