This module focuses on the principles and practices of effective data storytelling and communication, with an emphasis on multivariate visualisation, ethical considerations, and practical coding skills.
Reading: The Ethics of Data Visualization by Alberto Cairo
Data storytelling is the bridge between raw data analysis 📊 and meaningful action. While exploratory data analysis is about finding the signal in the noise, explanatory storytelling is about presenting that signal to stakeholders in a way that is clear, persuasive, and memorable.
Think of your data as the “facts” of a case. Without a narrative 📖, those facts are just a list. Storytelling provides the “argument” that tells the stakeholders why those facts matter to their specific business goals.
Narrative structure transforms a series of charts into a compelling argument. Instead of just showing data, we use a story arc to lead stakeholders through a journey of discovery. A classic framework for this is the Context-Complication-Resolution model.
Let’s decide where to go next to build these resources for your students:
This exercise is designed to shift students from “making charts” to “building a case.” By framing data points as characters, they learn to highlight the tension (the problem) and the resolution (the recommendation).
In this scenario, students act as Lead Data Analysts for Stream-It, a fictional video streaming service. Recent reports show a dip in revenue, and it’s their job to find the “Villain” causing the loss and the “Hero” that will save the quarter.
Your stakeholders are the Marketing and Product teams. They don’t want a 50-page technical report; they want to know:
Python code to generate a synthetic dataset with a hidden narrative:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Generate synthetic data
np.random.seed(42)
n_users = 1000
data = {
'User_ID': range(n_users),
'Subscription_Type': np.random.choice(['Basic', 'Premium', 'Family'], n_users),
'Monthly_Charges': np.random.uniform(10, 30, n_users),
'Region': np.random.choice(['North', 'South', 'East', 'West'], n_users),
'Churned': np.random.choice([0, 1], n_users, p=[0.7, 0.3]),
'Customer_Support_Calls': np.random.poisson(2, n_users),
'App_Engagement_Score': np.random.normal(50, 15, n_users)
}
df = pd.DataFrame(data)
# Inject the 'Villain': Higher churn for Basic users with high support calls
df.loc[(df['Subscription_Type'] == 'Basic') & (df['Customer_Support_Calls'] > 3), 'Churned'] = 1
# Inject the 'Hero': Users with high App_Engagement_Score almost never churn
df.loc[df['App_Engagement_Score'] > 70, 'Churned'] = 0
print(df.head())
Students must create three specific visualizations that tell the story:
Goal: Use a bar chart or heatmap to show that churn isn’t happening everywhere—it’s concentrated.
Subscription_Type and Customer_Support_Calls.Goal: Translate the data into business impact.
Goal: Find a segment that is succeeding and turn that insight into a recommendation.
App_Engagement_Score and Churned.Students should be graded not just on the code, but on their annotations.
This model solution focuses on Explanatory Data Viz. Instead of just showing the data, we are going to use “Active Titles” and annotations to guide the stakeholder’s eye.
Below is the Python code using Seaborn and Matplotlib. You can share this with your students as the “Goal” they should strive for.
First, we ensure the environment is set up and the “Villain” and “Hero” are baked into the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set the storytelling theme
sns.set_theme(style="white")
plt.rcParams['font.family'] = 'sans-serif'
# 1. Setup (Data Generation)
np.random.seed(42)
n_users = 1000
data = {
'Subscription_Type': np.random.choice(['Basic', 'Premium', 'Family'], n_users),
'Monthly_Charges': np.random.uniform(10, 30, n_users),
'Customer_Support_Calls': np.random.poisson(2, n_users),
'App_Engagement_Score': np.random.normal(50, 15, n_users),
'Churned': np.random.choice([0, 1], n_users, p=[0.7, 0.3])
}
df = pd.DataFrame(data)
# Inject the 'Villain': High churn for Basic users with >3 support calls
df.loc[(df['Subscription_Type'] == 'Basic') & (df['Customer_Support_Calls'] > 3), 'Churned'] = 1
# Inject the 'Hero': High engagement prevents churn
df.loc[df['App_Engagement_Score'] > 75, 'Churned'] = 0
The Story: We aren’t losing everyone; we are specifically failing our Basic tier users who need help.
# Create a pivot table for the heatmap
heatmap_data = df.groupby(['Subscription_Type', 'Customer_Support_Calls'])['Churned'].mean().unstack()
plt.figure(figsize=(10, 5))
sns.heatmap(heatmap_data, annot=True, cmap='Reds', fmt=".1f", cbar=False)
# Storytelling elements
plt.title("THE VILLAIN: Support Friction is Killing the 'Basic' Tier", fontsize=16, loc='left', pad=20)
plt.xlabel("Number of Customer Support Calls")
plt.ylabel("Subscription Plan")
plt.annotate('CRITICAL ZONE:\nBasic users with 4+ calls\nhave a 100% churn rate.',
xy=(5, 0.5), xytext=(7, 0.5),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.show()
The Story: This isn’t just a “metric”—it is a direct hit to our monthly revenue.
# Calculate lost revenue
lost_revenue = df[df['Churned'] == 1].groupby('Subscription_Type')['Monthly_Charges'].sum()
plt.figure(figsize=(8, 6))
ax = sns.barplot(x=lost_revenue.index, y=lost_revenue.values, palette=['#ff9999', '#cccccc', '#cccccc'])
# Storytelling elements
plt.title("THE STAKES: We are losing $1,800+ Monthly in 'Basic' alone", fontsize=16, loc='left', pad=20)
plt.ylabel("Potential Monthly Revenue Lost ($)")
plt.xlabel("Subscription Tier")
sns.despine()
# Add data labels
for p in ax.patches:
ax.annotate(f'${p.get_height():.0f}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points', fontweight='bold')
plt.show()
The Story: High app engagement is our “shield.” If we can move users into the app, the “Villain” (support friction) loses its power.
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df[df['Churned'] == 0], x='App_Engagement_Score', fill=True, label='Retained', color='teal')
sns.kdeplot(data=df[df['Churned'] == 1], x='App_Engagement_Score', fill=True, label='Churned', color='red')
# Storytelling elements
plt.title("THE HERO: High App Engagement is a Churn Vaccine", fontsize=16, loc='left', pad=20)
plt.axvline(75, color='green', linestyle='--')
plt.text(76, 0.02, "THE HERO ZONE:\nScores >75 = Zero Churn", color='green', fontweight='bold')
plt.legend()
sns.despine()
plt.show()
sns.despine()) and removed the color bar from the heatmap to keep the focus on the data.Video by Scott Klemmer on storyboards
🤔 comic strip: show flow, how does user figure in this?
star people: how to draw people
Sequence: what steps are involved?
Helps get stakeholders on the same page.
Here is an example of a storyboard

Paper prototypes, transparencies and sticky notes
Digital mockups
High fidelity mockups (controlled experiments)
Storyboarding for data visualization is like writing a script 📽️ before filming a movie. It helps us map out the Sequence—the logical flow of insights—so stakeholders don’t get lost between charts. It moves the focus from “how do I code this?” to “what am I trying to say?”
In Python, we can simulate this “sketching” phase by having students create a Story Skeleton. Instead of rendering complex charts immediately, they define the “Panels” of their story using a data structure. This ensures the narrative holds up before they spend hours on formatting.
Here are three ways we could structure a Python-based storyboarding exercise:
StoryFrame class. They must “instantiate” 4-5 frames of their story, specifying the Sequence, the Persona (the “Star Person” 👤 viewing the data), and the Key Takeaway.plt.text() to describe what the chart will show and where the annotations will go. This mimics the Paper Prototype 📝 approach.The Narrative Audit 📋: Students take an existing set of charts and write a Python “wrapper” or function that prints out the transition logic between them (e.g., “Because we see [X] in Frame 1, we must investigate [Y] in Frame 2”).
A Narrative Audit focuses on the “connective tissue” between your data visualizations. In storyboarding, this ensures that the transition from one chart to the next feels like a logical progression rather than a random jump.
Think of it like a comic strip 🎞️: if Panel A shows a character at home and Panel B shows them on Mars, the reader needs a “transition” panel (the rocket ship 🚀) to understand how they got there. In data, this means explaining why a specific insight in Chart 1 leads us to investigate the metric in Chart 2.
In this exercise, students are given a Python script that generates three correct but disconnected charts. Their job is to perform an “audit” and write the narrative bridge that connects them.
Provide students with this “broken” narrative. The charts are technically fine, but the story is missing.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Sample Data: Website Traffic and Sales
data = pd.DataFrame({
'Day': range(1, 8),
'Visitors': [1000, 1100, 1050, 1200, 1500, 1600, 1550],
'Bounce_Rate': [40, 42, 41, 39, 65, 68, 70],
'Conversion_Rate': [5, 5, 4.8, 5.2, 2.1, 1.8, 1.5]
})
def plot_narrative_gap():
# Chart 1: Traffic is growing
plt.figure(figsize=(5, 3))
sns.lineplot(data=data, x='Day', y='Visitors', marker='o')
plt.title("Total Website Visitors")
plt.show()
# Chart 2: Bounce rate spiked
plt.figure(figsize=(5, 3))
sns.lineplot(data=data, x='Day', y='Bounce_Rate', color='red')
plt.title("Bounce Rate Percentage")
plt.show()
# Chart 3: Conversion dropped
plt.figure(figsize=(5, 3))
sns.barplot(data=data, x='Day', y='Conversion_Rate')
plt.title("Sales Conversion Rate")
plt.show()
plot_narrative_gap()
Students must create a Python dictionary called narrative_audit. For each transition, they must identify:
narrative_audit = {
"Transition_1_to_2": {
"Observation": "Traffic is hitting record highs in the second half of the week.",
"The Question": "Is this high-volume traffic actually high-quality traffic?",
"Bridge": "To find out, we need to look at the **Bounce Rate** to see if people are sticking around."
},
"Transition_2_to_3": {
"Observation": "Bounce rates nearly doubled as traffic increased.",
"The Question": "How did this inability to retain users impact our bottom line?",
"Bridge": "We will now examine **Conversion Rates** to quantify the cost of this technical friction."
}
}
Instead of checking if the code runs, you are checking for Causality.
How do you think your students would react to critiquing “broken” stories like this versus building their own from scratch? Would they find it easier to spot logic gaps in someone else’s work first?

🥳 2 experts might figure it out, but the rest of the 8 billion people?
As shown in the figure below, overly complex visuals can fail to communicate outside a small expert audience.
Social media platforms have become central to modern communication, politics, and culture. However, they have also been associated with a range of potential harms, including misinformation, political polarization, mental health effects, and economic disruption.
In this assignment, you will analyze a synthetic dataset on social media harms across countries and use data visualization techniques to investigate patterns in the data. Based on your analysis, you will write a policy brief recommending whether a country should regulate, restrict, or ban social media platforms.
The goal of the assignment is not only to produce good visualizations, but also to interpret data critically and communicate policy implications clearly.
You are provided with a synthetic dataset containing indicators related to social media use and potential harms in different countries.
The dataset includes variables such as:
The dataset is synthetic, meaning it was generated artificially for the purpose of analysis and teaching. Treat it as if it were real data, but remember that conclusions are illustrative rather than factual.
Perform an initial exploration of the dataset.
You should:
Produce at least two visualizations showing patterns in the data.
Examples include:
Create at least three visualizations that illustrate different types of social media harms.
Examples of research questions you might explore include:
Your visualizations should:
Select one country from the dataset and conduct a deeper analysis.
You should:
Use visualizations to support your argument.
Write a short policy brief (800–1200 words) recommending one of the following actions for your chosen country:
Your policy brief should include:
A short paragraph summarizing your recommendation.
Use visualizations and data analysis to justify your position.
Discuss at least two possible policy approaches.
Explain which policy you recommend and why.
Discuss limitations of the dataset and your analysis.
Submit the following:
Code to generate synthetic data is here
# Fixed run: generate the dataset and save CSV.
import numpy as np
import pandas as pd
np.random.seed(42)
countries = [
"United States", "United Kingdom", "India", "China", "Brazil", "Nigeria", "Russia", "Germany",
"Australia", "Japan", "Sweden", "Mexico", "South Africa", "Turkey", "Egypt", "Saudi Arabia",
"Indonesia", "Argentina", "Poland", "Vietnam"
]
regime_map = {
"United States": "democracy",
"United Kingdom": "democracy",
"India": "democracy",
"China": "authoritarian",
"Brazil": "democracy",
"Nigeria": "hybrid",
"Russia": "authoritarian",
"Germany": "democracy",
"Australia": "democracy",
"Japan": "democracy",
"Sweden": "democracy",
"Mexico": "hybrid",
"South Africa": "hybrid",
"Turkey": "hybrid",
"Egypt": "authoritarian",
"Saudi Arabia": "authoritarian",
"Indonesia": "democracy",
"Argentina": "democracy",
"Poland": "democracy",
"Vietnam": "authoritarian"
}
n = len(countries)
population_m = np.random.uniform(5, 330, size=n).round(1)
internet_penetration = np.clip(np.random.normal(70, 15, n), 20, 98).round(1)
social_media_penetration = np.clip(internet_penetration * np.random.uniform(0.6, 0.95, n), 10, 98).round(1)
avg_daily_time = np.clip(np.random.normal(95, 35, n), 10, 400).round(1)
misinformation_index = np.clip(np.random.beta(2,5,n)*100 + (np.array([1 if regime_map[c]!="democracy" else 0 for c in countries])*10) + np.random.normal(0,6,n), 0, 100).round(1)
content_moderation_score = np.clip(np.random.normal(60, 18, n) - (np.array([1 if regime_map[c]=="authoritarian" else 0 for c in countries])*12), 5, 98).round(1)
censorship_level = np.clip(np.random.normal(25, 20, n) + (np.array([1 if regime_map[c]=="authoritarian" else 0 for c in countries])*45), 0, 100).round(1)
regulatory_strength = np.clip(np.random.beta(2,3,n) - (np.array([0.2 if regime_map[c]=="authoritarian" else 0 for c in countries])) + np.random.normal(0,0.05,n), 0, 1).round(2)
reported_harm_incidents_per_100k = np.clip((misinformation_index/100)*np.random.uniform(40,200,n) + (avg_daily_time/120)*np.random.uniform(5,50,n) + np.random.normal(0,10,n), 0, None).round(1)
youth_mental_health_decline_pct = np.clip((avg_daily_time/240)*np.random.uniform(5,35,n) + (misinformation_index/100)*np.random.uniform(2,15,n) + np.random.normal(0,2,n), 0, 50).round(2)
political_polarization_index = np.clip(np.random.normal(45,18,n) + (misinformation_index*0.15) - (content_moderation_score*0.1), 0, 100).round(1)
economic_dependency_pct = np.clip(np.random.normal(0.8,0.6,n) + (social_media_penetration/100)*np.random.uniform(0.1,1.5,n), 0, 8).round(2)
public_health_harm_score = np.clip(0.6*youth_mental_health_decline_pct + 0.2*(misinformation_index) + 0.2*(reported_harm_incidents_per_100k/10), 0, 100).round(1)
political_harm_score = np.clip(0.5*political_polarization_index + 0.4*misinformation_index + 0.1*censorship_level, 0, 100).round(1)
economic_harm_score = np.clip(0.5*economic_dependency_pct*10 + 0.3*(reported_harm_incidents_per_100k/20) + 0.2*(100-content_moderation_score)/10, 0, 100).round(1)
ban_risk_score_arr = np.clip(0.35*public_health_harm_score + 0.35*political_harm_score + 0.2*reported_harm_incidents_per_100k/10 + 0.1*(100* (1-regulatory_strength)), 0, 100).round(1)
df = pd.DataFrame({
"country": countries,
"population_m": population_m,
"regime": [regime_map[c] for c in countries],
"internet_penetration_pct": internet_penetration,
"social_media_penetration_pct": social_media_penetration,
"avg_daily_time_min": avg_daily_time,
"misinformation_index_0_100": misinformation_index,
"content_moderation_score_0_100": content_moderation_score,
"censorship_level_0_100": censorship_level,
"regulatory_strength_0_1": regulatory_strength,
"reported_harm_incidents_per_100k": reported_harm_incidents_per_100k,
"youth_mental_health_decline_pct": youth_mental_health_decline_pct,
"political_polarization_index_0_100": political_polarization_index,
"economic_dependency_pct_of_gdp": economic_dependency_pct,
"public_health_harm_score_0_100": public_health_harm_score,
"political_harm_score_0_100": political_harm_score,
"economic_harm_score_0_100": economic_harm_score,
"ban_risk_score_0_100": ban_risk_score_arr
})
# Analysis "solution" for the classroom exercise.
# Loads the synthetic CSV and produces:
# 1) Descriptive statistics
# 2) Correlation matrix (displayed)
# 3) Scatter plots with linear fit for two pairs of interest
# 4) Counts of suggested_policy_action by regime (table)
# 5) K-means clustering (k=3) on harm scores and cluster centers
# 6) Top 5 countries by ban_risk_score
# Saves a small report CSV and plots to /mnt/data for download.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
#from caas_jupyter_tools import display_dataframe_to_user
# Load dataset
#path = "/mnt/data/synthetic_social_media_harms.csv"
#df = pd.read_csv(path)
# 1) Descriptive statistics (selected columns)
desc_cols = [
"internet_penetration_pct","social_media_penetration_pct","avg_daily_time_min",
"misinformation_index_0_100","reported_harm_incidents_per_100k","youth_mental_health_decline_pct",
"political_polarization_index_0_100","public_health_harm_score_0_100","political_harm_score_0_100",
"economic_harm_score_0_100","ban_risk_score_0_100"
]
desc = df[desc_cols].describe().round(2)
#display_dataframe_to_user("Descriptive statistics (selected columns)", desc.reset_index())
# 2) Correlation matrix
corr = df[desc_cols].corr().round(2)
#display_dataframe_to_user("Correlation matrix (selected harm & exposure variables)", corr.reset_index())
# 3) Scatter: avg_daily_time_min vs youth_mental_health_decline_pct with linear fit
x = df["avg_daily_time_min"].values
y = df["youth_mental_health_decline_pct"].values
coef = np.polyfit(x, y, 1)
poly1d = np.poly1d(coef)
plt.figure(figsize=(7,5))
plt.scatter(x, y)
plt.plot(np.sort(x), poly1d(np.sort(x)))
plt.xlabel("avg_daily_time_min")
plt.ylabel("youth_mental_health_decline_pct")
plt.title("Scatter: avg daily social media time vs youth mental-health decline")
plt.tight_layout()
#plt.savefig("/mnt/data/plot_time_vs_mental_health.png")
plt.show()
# Linear fit stats (R^2)
y_pred = poly1d(x)
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - np.mean(y))**2)
r2_time = 1 - ss_res/ss_tot
# 4) Scatter: misinformation_index vs political_polarization_index with fit
x2 = df["misinformation_index_0_100"].values
y2 = df["political_polarization_index_0_100"].values
coef2 = np.polyfit(x2, y2, 1)
poly2 = np.poly1d(coef2)
y2_pred = poly2(x2)
ss_res2 = np.sum((y2 - y2_pred)**2)
ss_tot2 = np.sum((y2 - np.mean(y2))**2)
r2_misinfo = 1 - ss_res2/ss_tot2
plt.figure(figsize=(7,5))
plt.scatter(x2, y2)
plt.plot(np.sort(x2), poly2(np.sort(x2)))
plt.xlabel("misinformation_index_0_100")
plt.ylabel("political_polarization_index_0_100")
plt.title("Scatter: misinformation vs political polarization")
plt.tight_layout()
#plt.savefig("/mnt/data/plot_misinfo_vs_polarization.png")
plt.show()
# 5) Counts of suggested_policy_action by regime
counts = df.groupby(["regime","suggested_policy_action"]).size().unstack(fill_value=0)
#display_dataframe_to_user("Suggested policy action counts by regime", counts.reset_index())
# 6) K-means clustering on harm scores (public, political, economic)
harm_features = df[["public_health_harm_score_0_100","political_harm_score_0_100","economic_harm_score_0_100"]].copy()
scaler = StandardScaler()
harm_scaled = scaler.fit_transform(harm_features)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(harm_scaled)
df["harm_cluster"] = clusters
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_).round(2)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=harm_features.columns)
cluster_centers_df["cluster"] = cluster_centers_df.index
#display_dataframe_to_user("K-means cluster centers (k=3) on harm scores (original scale)", cluster_centers_df)
# Show cluster membership table (country -> cluster)
cluster_table = df[["country","regime","ban_risk_score_0_100","suggested_policy_action","harm_cluster"]].sort_values("ban_risk_score_0_100", ascending=False)
#display_dataframe_to_user("Countries with cluster membership and key metrics", cluster_table.reset_index(drop=True))
# 7) Top 5 countries by ban_risk_score
top5 = df.nlargest(5, "ban_risk_score_0_100")[["country","ban_risk_score_0_100","regime","suggested_policy_action"]]
#display_dataframe_to_user("Top 5 countries by ban risk score", top5.reset_index(drop=True))
# Save CSV report
report_csv = "/mnt/data/synthetic_solution_report.csv"
#df.to_csv(report_csv, index=False)
# Print summary stats for assistant text
summary = {
"r2_time_vs_mental_health": round(r2_time,3),
"coef_time_vs_mental_health": coef.round(3).tolist(),
"r2_misinfo_vs_polarization": round(r2_misinfo,3),
"coef_misinfo_vs_polarization": coef2.round(3).tolist(),
"cluster_centers": cluster_centers_df.to_dict(orient="records"),
"top5_list": top5.to_dict(orient="records"),
"report_csv": report_csv,
"plot_time_vs_mental_health": "/mnt/data/plot_time_vs_mental_health.png",
"plot_misinfo_vs_polarization": "/mnt/data/plot_misinfo_vs_polarization.png"
}
summary
Lovable
Replit
Cursor
Google AI studio
Base44
The User Experience: A detailed look at the components of user experience design.