visualization_lecture

John Snow visualizations (Broad Street pump London)

John Snow’s 1854 map of the Soho cholera outbreak is a foundational case study in data visualization 🗺️. By plotting deaths as individual black bars at specific addresses, Snow provided spatial evidence that challenged the prevailing “miasma” theory (the belief that disease spread through “bad air”) and identified the Broad Street pump as the source of the contagion 🧪.

  1. Visual Encoding & Design Analysis: This track focuses on the “how.” You can teach students about Snow’s choice of marks (the bars) and channels (spatial position) ✒️. We can also look at the Voronoi diagram added to later versions, which used geometry to show which houses were mathematically closest to the Broad Street pump.
  2. Evidence-Based Storytelling: This track focuses on the “why.” It explores how Snow used data to pivot public health policy. It’s an excellent way to discuss the ethics of data ⚖️ and how a visualization can be a tool for advocacy rather than just a neutral report.
  3. Modern Technical Re-creation: This is a hands-on track where students use modern datasets to recreate Snow’s analysis. We can develop a lab guide for using tools like R (ggplot2), Python (Folium/GeoPandas), or GIS software to create heat maps and spatial joins 💻.

John Snow’s map reimagined

This video provides a modern geospatial walkthrough of how Snow’s data is visualized today, which can help your students see the connection between 19th-century methods and current technology.

Exercise

To get started with your lab, we will use a digitized version of the 1854 Soho data that includes modern GPS coordinates.

🗺️ The Data

The most reliable source for this exercise is the Robin Wilson dataset, which has been formatted into CSVs. You can read these directly into Pandas using the URLs below:

Data from Robin’s Blog


🐍 Boilerplate Python Code

This script handles the heavy lifting: it loads the data, centers a map on the historic Broad Street pump, and layers on the density.

Folium is a powerful Python library used to create interactive maps 🗺️. It acts as a bridge between Python’s data manipulation capabilities and Leaflet.js, a popular JavaScript library for mobile-friendly interactive maps.

With Folium, you can:

🛠️ Basic Intro Code

To get a map running, you only need a few lines. This example centers a map on the Broad Street Pump coordinates and adds a simple marker.

import folium

# 1. Create a Map object 
# 'location' takes [latitude, longitude]
# 'tiles' changes the background style
study_area = folium.Map(location=[51.5132, -0.1367], zoom_start=17, tiles="OpenStreetMap")

# 2. Add a simple Marker
folium.Marker(
    location=[51.5132, -0.1367],
    popup="Broad Street Pump",
    tooltip="Click for info",
    icon=folium.Icon(color="red", icon="info-sign")
).add_to(study_area)

# 3. Display the map
study_area

import pandas as pd
import folium
from folium.plugins import HeatMap

# 1. Load the data
#deaths = pd.read_csv("https://raw.githubusercontent.com/JimGrum/JohnSnow/master/data/deaths.csv")
#pumps = pd.read_csv("https://raw.githubusercontent.com/JimGrum/JohnSnow/master/data/pumps.csv")


import numpy as np

# 1. Generate Synthetic Data
# Define the main Broad Street Pump location
broad_st_pump = [51.5132, -0.1367]

# Create 50 deaths clustered tightly around the Broad Street pump
# np.random.normal adds a small 'jitter' to the coordinates
lat_cluster = np.random.normal(51.5132, 0.0005, 50)
lon_cluster = np.random.normal(-0.1367, 0.0005, 50)

# Create a small DataFrame for these synthetic deaths
deaths = pd.DataFrame({'Lat': lat_cluster, 'Lon': lon_cluster})

# Create a simple DataFrame for 2 pumps
pumps = pd.DataFrame({
    'Pump_Name': ['Broad Street Pump', 'Oxford Street Pump'],
    'Lat': [51.5132, 51.5150],
    'Lon': [-0.1367, -0.1350]
})

# 🛠️ Now plot! Without looking at the code below!

# 2. Initialize the map (Centered on Soho, London)
# Coordinates: 51.5132, -0.1367
m = folium.Map(location=[51.5132, -0.1367], zoom_start=17, tiles="cartodbpositron")

# 3. Add Pumps as markers
for _, pump in pumps.iterrows():
    folium.Marker(
        location=[pump['Lat'], pump['Lon']],
        popup=pump['Pump_Name'],
        icon=folium.Icon(color='blue', icon='tint')
    ).add_to(m)

# 4. Create the HeatMap
# Because we have individual records, we just need a list of [lat, lon] pairs.
heat_data = deaths[['Lat', 'Lon']].values.tolist()

# The 'radius' and 'blur' determine how the "heat" spreads between points
HeatMap(heat_data, radius=15, blur=20).add_to(m)

# 5. Display the map
m

🧪 Understanding the “Heat”

In this setup, we didn’t specify a “weight” for the points. Folium’s HeatMap simply looks at the coordinate list and says, “There is 1 death at this exact spot.” When ten rows have nearly identical coordinates, the color turns from cool blue to a “hot” red.

Since we are trying to prove a causal link between the pumps and the deaths, the visual contrast is key.

Looking at the code above, the radius and blur parameters in HeatMap are essentially your “statistical tuning knobs.” If you set the radius too high, the whole map becomes a red blob; too low, and it looks like a scattered rash.

How do you think changing the radius might affect your students’ ability to identify the specific pump responsible for the outbreak? 🧐


Interactive John Snow Map Tutorial This video demonstrates how to take raw CSV data and transform it into a dynamic Folium map, which is exactly what we are doing with the cholera records.

Questions

Visualizing Uncertainty: How we can use Python to show where the data might be “fuzzy” because of how it was digitized from a paper map 📜?

🎮🛠️ Advanced Exercise: Hunt for Epidemic Center in COVID-19

In this scenario, your students are “Digital Epidemiologists.” They have been handed a “noisy” dataset of hospital admissions and must determine if there is a single point of origin or if the spread is truly random.


🕵️‍♂️ The Mission: The Wuhan “Patient Zero” Hunt

The Backstory: It’s early January 2020. Hospitals across Wuhan are reporting a “pneumonia of unknown cause.” Your task is to map the first 500 reported cases. If John Snow was right, the “Pump” (the source) will be at the heart of the highest density cluster.

Step 1: Generate the Evidence (Synthetic Data)

Students will run this block first to create their “Evidence Files” (cases.csv and points_of_interest.csv).

import pandas as pd
import numpy as np

# 1. Set the "Hidden" Source: Huanan Seafood Market
# Coordinates: 30.6195, 114.2577
market_lat, market_lon = 30.6195, 114.2577

# 2. Generate 500 Synthetic Cases
# 70% of cases are tightly clustered around the market (The Source)
cluster_count = 350
cluster_lats = np.random.normal(market_lat, 0.005, cluster_count)
cluster_lons = np.random.normal(market_lon, 0.005, cluster_count)

# 30% are scattered randomly across the city (Community spread/noise)
noise_count = 150
noise_lats = np.random.uniform(30.50, 30.70, noise_count)
noise_lons = np.random.uniform(114.20, 114.40, noise_count)

# Combine into a DataFrame
df_cases = pd.DataFrame({
    'case_id': range(500),
    'lat': np.concatenate([cluster_lats, noise_lats]),
    'lon': np.concatenate([cluster_lons, noise_lons])
})

# 3. List of Potential "Sources" (The Scavenger Hunt Targets)
df_pois = pd.DataFrame({
    'name': ['Wuhan International Plaza', 'Huanan Seafood Market', 'Hankou Railway Station', 'Wuhan CDC'],
    'lat': [30.584, 30.6195, 30.618, 30.612],
    'lon': [114.271, 114.2577, 114.250, 114.265]
})

df_cases.to_csv('wuhan_cases.csv', index=False)
df_pois.to_csv('wuhan_pois.csv', index=False)
print("Data Generated! You now have 'wuhan_cases.csv' and 'wuhan_pois.csv'.")

cases = df_cases
pois = df_pois


Step 2: The Scavenger Hunt Challenge

Now, students can take this boilerplate. Their goal is to visualize the data and answer the Investigation Questions below.

import pandas as pd
import folium
from folium.plugins import HeatMap

# LOAD DATA
cases = pd.read_csv('wuhan_cases.csv')
pois = pd.read_csv('wuhan_pois.csv')

# INITIALIZE MAP
# Center on Wuhan
m = folium.Map(location=[30.6, 114.3], zoom_start=13, tiles='cartodbpositron')

# TASK 1: Create a Heatmap of the 'cases'
heat_data = cases[['lat', 'lon']].values.tolist()
HeatMap(heat_data, radius=12, blur=15).add_to(m)

# TASK 2: Add markers for the POIs (Points of Interest)
# Use a different color to distinguish them from the 'heat'
for _, poi in pois.iterrows():
    folium.Marker(
        location=[poi['lat'], poi['lon']],
        popup=poi['name'],
        icon=folium.Icon(color='black', icon='question-sign')
    ).add_to(m)

m.save('wuhan_investigation.html')
m


🔍 Investigation Questions for Students

  1. The “Hot Zone”: Looking at the heatmap, which of the four black markers sits directly in the center of the “red” zone?
  2. The Red Herring: One marker is near a major transportation hub (Hankou Station). Why might an epidemiologist mistake a transportation hub for a source?
  3. Data Noise: You see cases scattered far away from the center. Does this disprove the “Market Theory,” or does it represent a different stage of an outbreak? (Think: Secondary Transmission).
  4. The “Broad Street” Moment: In 1854, Snow removed the pump handle. If you were the health official in Wuhan based only on this map, what would be your first “emergency” recommendation?

💡 Pro-Tip for the Lab

Encourage students to tweak the radius and blur in the HeatMap function. If they set the radius to 2, the map will look like a scattered mess; if they set it to 50, the entire city will look like it’s on fire. Finding the “just right” visualization is part of the data viz craft!