Skip to content

Add public UK calibrated transfer dataset#321

Merged
MaxGhenis merged 15 commits intomainfrom
codex/policybench-uk-transfer-dataset-pr
Apr 26, 2026
Merged

Add public UK calibrated transfer dataset#321
MaxGhenis merged 15 commits intomainfrom
codex/policybench-uk-transfer-dataset-pr

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Mar 29, 2026

Summary

  • add a public calibrated enhanced_cps_2025 transfer dataset built from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS
  • map those shared households into a UKSingleYearDataset, align additional UK-facing inputs, and recalibrate household weights against the UK target registry
  • keep backward-compatible policybench_transfer aliases and add validation tests

Details

This PR introduces a public UK calibrated transfer dataset intended as the first step in a broader cross-country public-microdata strategy.

The merged source manifest contains 28,532 households. The source households come from PolicyBench-compatible PolicyEngine US Enhanced CPS records, then are mapped into UK inputs with synthetic geography plus additional UK-facing alignment for:

  • council tax bands
  • vehicle ownership
  • pensions
  • disability / PIP status
  • consumption / fuel spending
  • capital gains

The resulting dataset is recalibrated to the UK national / region / country target registry used by the loss pipeline.

Files added or updated include:

  • policyengine_uk_data/datasets/enhanced_cps.py
  • policyengine_uk_data/datasets/policybench_transfer.py
  • policyengine_uk_data/utils/reweight.py
  • policyengine_uk_data/storage/enhanced_cps_source_2025.csv
  • policyengine_uk_data/storage/enhanced_cps_2025.h5
  • policyengine_uk_data/storage/policybench_transfer_2025.h5
  • policyengine_uk_data/tests/test_policybench_transfer.py

Loss Comparison

The defensible comparison is that calibration materially improves the raw transfer dataset. It is not evidence that the public transfer dataset is better than enhanced FRS.

A fresh apples-to-apples check using the 2025 target matrix gives:

dataset target year mean abs rel error median abs rel error share within 10%
Enhanced CPS transfer 2025 0.330 0.192 0.285
Enhanced FRS 2025 0.252 0.056 0.654

Those figures exclude the one zero-target/nonfinite relative-error row. Including it makes the enhanced FRS raw mean infinite because slc/plan_5_borrowers_above_threshold has target 0 and a nonzero estimate; that is a denominator artifact, not a substantive data-quality result.

Interpretation:

  • calibrated enhanced_cps materially improves on the raw transfer dataset
  • enhanced_frs still has tighter central fit and many more targets within 10%
  • aggregate target MARE is only one validation measure and does not establish joint-distribution quality

Validation

  • python3 -m pytest policyengine_uk_data/tests/test_policybench_transfer.py
  • generated the calibrated .h5 artifacts and verified dataset.validate()
  • ran UK microsimulation and compared target loss before vs after calibration

Caveat

This is not a replacement for the FRS or enhanced FRS. It is a public calibrated transfer dataset built from open US microdata, so some UK-only supports are still missing relative to the full FRS-based pipeline.

@MaxGhenis MaxGhenis changed the title Add public PolicyBench UK transfer dataset Add calibrated UK enhanced CPS dataset Mar 29, 2026
Copy link
Copy Markdown

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments. I'm still not 100% sure what's going on here, regarding using CPS households from the US for UK estimation, but it looks interesting.

Also, do you really want to commit policyengine_uk_data/storage/enhanced_cps_2025.h5?

Comment thread policyengine_uk_data/datasets/enhanced_cps.py Outdated
@MaxGhenis MaxGhenis changed the title Add calibrated UK enhanced CPS dataset Add public UK calibrated transfer dataset Apr 26, 2026
@MaxGhenis
Copy link
Copy Markdown
Contributor Author

On the committed enhanced_cps_2025.h5: yes, this PR currently treats it as a checked-in public artifact, with the source manifest also versioned. I kept the H5 committed, rebuilt it after the valid-leaf and FX-assumption fixes, and documented that generated metrics should be reported from named release artifacts rather than copied into the README. If we want this to be builder-only instead, the follow-up would be to remove the H5/source artifact from the PR and publish it through the normal release-data path instead.

@MaxGhenis MaxGhenis merged commit 9514dfb into main Apr 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants