Concept
What Is Unsupervised Learning?
One-sentence definition
Unsupervised learning finds structure — clusters, patterns, or compressed representations — in unlabelled data, letting the algorithm organise it without being told what to look for.
What real data or case will you use to illustrate it?
The Benefits Denial audit (UCI Adult Census Income, 48,842 records) is the anchor. The explainer will run k-means clustering on the applicant feature space without the income label and show that the resulting clusters correlate strongly with sex, race, and national origin — the protected attributes the audit targets. This directly demonstrates how unsupervised methods can rediscover demographic groupings even without being given them, which connects back to the proxy variable explainer (proxy-variables.md) and the occupational-segregation proxies identified in fair.py (relationship, marital.status, occupation). The dataset's scale (48k records) makes the cluster/demographic correlation statistically robust and visualisable.
As an external case we will reference the well-documented use of ZIP-code clustering in insurance pricing, citing the ProPublica "Machine Bias" series and the Missouri Department of Insurance findings, both publicly available.
What are the limitations or trade-offs of this concept?
No ground truth means there is no standard evaluation metric — you cannot tell from the algorithm alone whether a discovered structure is meaningful or artefactual.
The practitioner must choose the number of clusters and a distance metric; both choices embed assumptions about what similarity means, and those assumptions can encode demographic bias before a single data point is processed.
Because there are no labels, disparate impact is much harder to audit than in supervised settings — a clustering that looks feature-neutral can still sort people along demographic lines, as the illustration above shows.
Dimensionality reduction (PCA, UMAP) used in preprocessing can suppress variance that is fairness-relevant, compressing away the signal needed to detect discrimination downstream.
Before you start
Concept
What Is Unsupervised Learning?
One-sentence definition
Unsupervised learning finds structure — clusters, patterns, or compressed representations — in unlabelled data, letting the algorithm organise it without being told what to look for.
What real data or case will you use to illustrate it?
The Benefits Denial audit (UCI Adult Census Income, 48,842 records) is the anchor. The explainer will run k-means clustering on the applicant feature space without the income label and show that the resulting clusters correlate strongly with sex, race, and national origin — the protected attributes the audit targets. This directly demonstrates how unsupervised methods can rediscover demographic groupings even without being given them, which connects back to the proxy variable explainer (proxy-variables.md) and the occupational-segregation proxies identified in fair.py (relationship, marital.status, occupation). The dataset's scale (48k records) makes the cluster/demographic correlation statistically robust and visualisable.
As an external case we will reference the well-documented use of ZIP-code clustering in insurance pricing, citing the ProPublica "Machine Bias" series and the Missouri Department of Insurance findings, both publicly available.
What are the limitations or trade-offs of this concept?
No ground truth means there is no standard evaluation metric — you cannot tell from the algorithm alone whether a discovered structure is meaningful or artefactual.
The practitioner must choose the number of clusters and a distance metric; both choices embed assumptions about what similarity means, and those assumptions can encode demographic bias before a single data point is processed.
Because there are no labels, disparate impact is much harder to audit than in supervised settings — a clustering that looks feature-neutral can still sort people along demographic lines, as the illustration above shows.
Dimensionality reduction (PCA, UMAP) used in preprocessing can suppress variance that is fairness-relevant, compressing away the signal needed to detect discrimination downstream.
Before you start