In programming, functions perform specific operations, taking inputs (arguments) and returning values.
For example, in Python you can call built-in functions like math.sqrt(9) or round(10.232, 1), which return numerical results.
To store values we assign them to variables (Python’s equivalent of R “objects”). For instance:
age = 21
print(age) # Outputs: 21
double_age = age * 2
This mirrors R’s assignment (age <- 21) but uses = in Python. Python supports various data types (numbers, strings, booleans) similar to R. Strings must be in quotes ("text"), and booleans are True / False. Like R’s NA, Python uses None for missing values (handled carefully in computations).
Common Python data structures: lists, dictionaries, and pandas.DataFrame for tabular data (similar to R’s data.frame / tibble). A DataFrame lets you work with columns and rows.
Download the data from the following link and save it in a new folder named data in your working directory: DATA
Example — read a CSV of Darwin’s finch measurements:
import pandas as pd
finches = pd.read_csv("data/finches.csv")
print(finches.head())
A DataFrame will have columns like species, weight, etc. Inspect the first few rows with finches.head(). Pandas automatically infers data types (numeric, string, etc.), similar to R’s tibble output.
Select columns using bracket notation:
subset = finches[['group', 'wing']]
print(subset.head())
Filter rows using boolean indexing. Example — keep only species G. fortis:
fortis = finches[finches['species'] == "G. fortis"]
print(fortis.shape) # e.g., (89, 12)
Filter by numeric condition:
heavy = finches[finches['weight'] > 18]
Chaining (filter then select) with .loc:
result = finches.loc[finches.weight > 18, ['species','weight']]
This corresponds to R’s pipe-based workflows (e.g. finches %>% filter(weight > 18) %>% select(species, weight)).
finches.csv with pandas. Print the first 5 rows.weight_kg = weight / 1000. Verify it by displaying the columns and first few rows.Add or transform columns:
finches['weight_kg'] = finches['weight'] / 1000
Reorder columns (example: move weight_kg so it sits right after weight):
cols = list(finches.columns)
cols.insert(cols.index('weight')+1, cols.pop(cols.index('weight_kg')))
finches = finches[cols]
This mimics R’s mutate() and relocate() patterns.
Count observations per species:
counts = finches.groupby('species').size().reset_index(name='count')
print(counts)
Summary statistics (mean/median/min/max) per species:
summary = finches.groupby('species')['weight'].agg(['mean','median','min','max']).reset_index()
summary.columns = ['species','avg_weight','median_weight','min_weight','max_weight']
print(summary)
Count by species and group:
counts = finches.groupby(['species','group']).size().reset_index(name='n')
print(counts.head(6))
Pivot to wide format:
finches_wide = counts.pivot(index='species', columns='group', values='n').reset_index()
print(finches_wide)
Save transformed results:
finches_wide.to_csv("data/finches_wide.csv", index=False)
finches DataFrame, group by species and compute the average and standard deviation of wing.Use
matplotlib+seabornfor static plots;plotlyfor interactive variants when desired.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load data (assumes the finches DataFrame is available)
finches = pd.read_csv("data/finches.csv")
Explore relationships between two numeric variables (example: beak depth vs beak length):
sns.scatterplot(data=finches, x='bdepth', y='blength')
plt.title("Finch beak length vs depth")
plt.show()
Color by categorical variable (e.g. species):
sns.scatterplot(data=finches, x='bdepth', y='blength', hue='species', alpha=0.7)
alpha to reduce overplotting.Good for ordered/time data:
# Example assumes a 'year' column exists
sns.lineplot(data=finches, x='year', y='weight', hue='species')
plt.title("Finch weight over years")
plt.show()
Visualise distributions by category:
sns.boxplot(data=finches, x='species', y='weight')
sns.stripplot(data=finches, x='species', y='weight', color='blue', jitter=0.1, alpha=0.6)
plt.show()
Frequency distribution of a single variable:
sns.histplot(data=finches, x='bdepth', bins=10)
plt.show()
weight by group (e.g., "early_blunt", "late_pointed", etc.) for G. scandens only. Overlay the individual data points in a different color.bdepth for finches recorded after 1983. Try different numbers of bins and comment on how the histogram shape changes.import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load gapminder-like data
gap = pd.read_csv("data/gapminder_clean.csv")
Example: life expectancy vs income per person:
sns.scatterplot(data=gap, x='income_per_person', y='life_expectancy', hue='world_region')
plt.title("Life expectancy vs Income per person")
plt.show()
sizes = (gap['population'] - gap['population'].min()) / (gap['population'].max() - gap['population'].min())
sns.scatterplot(data=gap, x='income_per_person', y='life_expectancy', hue='world_region', size=sizes, sizes=(20,200), alpha=0.7)
plt.show()
numeric_cols = gap.select_dtypes(include='number').drop(columns=['year'])
corr_matrix = numeric_cols.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation matrix (Pearson)")
plt.show()
Find top and bottom 5 countries by income and compare:
top5 = gap.nlargest(5, 'income_per_person')
bot5 = gap.nsmallest(5, 'income_per_person')
combined = pd.concat([top5.assign(Group='Top 5'), bot5.assign(Group='Bottom 5')])
sns.boxplot(data=combined, x='Group', y='income_per_person')
plt.title("Income per person: Top 5 vs Bottom 5 countries")
plt.show()
life_expectancy vs income_per_person with hue='main_religion'. Do you see any patterns? (Hint: try a log scale on the x-axis: plt.xscale('log').)income_per_person. Create a bar chart or box chart comparing their life expectancy distributions.import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
gap = pd.read_csv("data/gapminder_clean.csv")
Order regions by descending average income:
mean_income = gap.groupby('world_region')['income_per_person'].mean().sort_values()
ordered_regions = mean_income.index.tolist()
gap['world_region'] = pd.Categorical(gap['world_region'], categories=ordered_regions, ordered=True)
sns.barplot(data=gap, x='world_region', y='income_per_person', estimator=np.mean)
plt.xticks(rotation=45)
plt.title("Average income per region (ordered)")
plt.show()
MEA = gap[gap.world_region == "middle_east_north_africa"].sort_values(by='income_per_person')
plt.figure(figsize=(8,6))
plt.hlines(y=MEA['country'], xmin=0, xmax=MEA['income_per_person'], color='gray', alpha=0.7)
plt.plot(MEA['income_per_person'], MEA['country'], "o")
plt.title("Income per person: Middle East & N. Africa")
plt.xlabel("Income per person")
plt.show()
Use grey to de-emphasise background data and highlight focal elements. Strategies:
alpha).gapminder_clean.csv, reorder the x-axis of a bar plot of income_per_person by region from highest to lowest. Then produce a bar plot of children_per_woman for countries in Europe, ordered descending by value.children_per_woman with countries ordered by that value.Concepts adapted from: Visual Data Communication Cambio Training