Skip to content

Train models for CPS-only variables on PUF clone half #561

@MaxGhenis

Description

@MaxGhenis

Problem

When building the extended CPS, the PUF clone half receives CPS-only variables (like retirement contributions) either by:

  1. Direct duplication from the CPS donor record (else branch in puf_clone_dataset)
  2. PUF override via OVERRIDDEN_IMPUTED_VARIABLES (e.g. pre_tax_contributions)

Neither approach preserves the relationship between these variables and income. A PUF clone with $0 wages can end up with $50k in 401(k) contributions, because there's no model linking contributions to the income variables that are common between CPS and PUF.

This creates implausible records and makes calibration harder — you can't calibrate away a structural data quality issue.

Proposed solution

For variables that exist in CPS but not PUF, train predictive models using the CPS half to predict these variables from features common to both CPS and PUF:

Common predictors (available in both datasets):

  • Wages/salary income (employment_income)
  • Self-employment income
  • Interest/dividend income
  • Age
  • Filing status
  • Number of dependents
  • Social Security income (sub-components)
  • Pension/retirement income

Variables to model (examples):

  • pre_tax_contributions (retirement contributions — see Add calibration targets for retirement contributions #553)
  • traditional_401k_contributions
  • traditional_ira_contributions
  • roth_401k_contributions
  • self_employed_pension_contributions
  • Other CPS-only variables currently in OVERRIDDEN_IMPUTED_VARIABLES that should respect income relationships

Approach

  1. On the CPS half (which has both income variables and CPS-only variables), train lightweight models (e.g. quantile regression, gradient boosting) predicting each CPS-only variable from the common features
  2. Apply these models to the PUF clone half, using the PUF-derived income values as inputs
  3. This ensures that a PUF clone with high wages gets plausible retirement contributions, and one with $0 wages gets $0 contributions

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions