Add public UK calibrated transfer dataset#321
Conversation
baogorek
left a comment
There was a problem hiding this comment.
Just a few comments. I'm still not 100% sure what's going on here, regarding using CPS households from the US for UK estimation, but it looks interesting.
Also, do you really want to commit policyengine_uk_data/storage/enhanced_cps_2025.h5?
…transfer-dataset-pr # Conflicts: # README.md # uv.lock
|
On the committed |
Summary
enhanced_cps_2025transfer dataset built from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPSUKSingleYearDataset, align additional UK-facing inputs, and recalibrate household weights against the UK target registrypolicybench_transferaliases and add validation testsDetails
This PR introduces a public UK calibrated transfer dataset intended as the first step in a broader cross-country public-microdata strategy.
The merged source manifest contains
28,532households. The source households come from PolicyBench-compatible PolicyEngine US Enhanced CPS records, then are mapped into UK inputs with synthetic geography plus additional UK-facing alignment for:The resulting dataset is recalibrated to the UK national / region / country target registry used by the loss pipeline.
Files added or updated include:
policyengine_uk_data/datasets/enhanced_cps.pypolicyengine_uk_data/datasets/policybench_transfer.pypolicyengine_uk_data/utils/reweight.pypolicyengine_uk_data/storage/enhanced_cps_source_2025.csvpolicyengine_uk_data/storage/enhanced_cps_2025.h5policyengine_uk_data/storage/policybench_transfer_2025.h5policyengine_uk_data/tests/test_policybench_transfer.pyLoss Comparison
The defensible comparison is that calibration materially improves the raw transfer dataset. It is not evidence that the public transfer dataset is better than enhanced FRS.
A fresh apples-to-apples check using the 2025 target matrix gives:
Those figures exclude the one zero-target/nonfinite relative-error row. Including it makes the enhanced FRS raw mean infinite because
slc/plan_5_borrowers_above_thresholdhas target0and a nonzero estimate; that is a denominator artifact, not a substantive data-quality result.Interpretation:
enhanced_cpsmaterially improves on the raw transfer datasetenhanced_frsstill has tighter central fit and many more targets within 10%Validation
python3 -m pytest policyengine_uk_data/tests/test_policybench_transfer.py.h5artifacts and verifieddataset.validate()Caveat
This is not a replacement for the FRS or enhanced FRS. It is a public calibrated transfer dataset built from open US microdata, so some UK-only supports are still missing relative to the full FRS-based pipeline.