What Is Unsupervised Learning?

### Concept

What Is Unsupervised Learning?

### One-sentence definition

Unsupervised learning finds structure — clusters, patterns, or compressed representations — in unlabelled data, letting the algorithm organise it without being told what to look for.

### What real data or case will you use to illustrate it?

The Benefits Denial audit (UCI Adult Census Income, 48,842 records) is the anchor. The explainer will run k-means clustering on the applicant feature space without the income label and show that the resulting clusters correlate strongly with sex, race, and national origin — the protected attributes the audit targets. This directly demonstrates how unsupervised methods can rediscover demographic groupings even without being given them, which connects back to the proxy variable explainer (proxy-variables.md) and the occupational-segregation proxies identified in fair.py (relationship, marital.status, occupation). The dataset's scale (48k records) makes the cluster/demographic correlation statistically robust and visualisable.
As an external case we will reference the well-documented use of ZIP-code clustering in insurance pricing, citing the ProPublica "Machine Bias" series and the Missouri Department of Insurance findings, both publicly available.

### What are the limitations or trade-offs of this concept?

No ground truth means there is no standard evaluation metric — you cannot tell from the algorithm alone whether a discovered structure is meaningful or artefactual.
The practitioner must choose the number of clusters and a distance metric; both choices embed assumptions about what similarity means, and those assumptions can encode demographic bias before a single data point is processed.
Because there are no labels, disparate impact is much harder to audit than in supervised settings — a clustering that looks feature-neutral can still sort people along demographic lines, as the illustration above shows.
Dimensionality reduction (PCA, UMAP) used in preprocessing can suppress variance that is fairness-relevant, compressing away the signal needed to detect discrimination downstream.

### Before you start

- [x] I've checked the explainers/ folder — this concept isn't already covered
- [x] I have real data or a documented external case to illustrate the concept (not a toy example)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What Is Unsupervised Learning? #35

Concept

One-sentence definition

What real data or case will you use to illustrate it?

What are the limitations or trade-offs of this concept?

Before you start

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

What Is Unsupervised Learning? #35

Description

Concept

One-sentence definition

What real data or case will you use to illustrate it?

What are the limitations or trade-offs of this concept?

Before you start

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions