Add eval harness scaffold: spec, scenarios, fixtures dir by SakshiKekre · Pull Request #52 · PolicyEngine/policyengine-uk-chat

SakshiKekre · 2026-05-15T18:23:16Z

Summary

End-to-end eval harness for uk-chat: scenarios, fixtures, runner, grader, and a first run's results. Originally scoped as just the scaffold + scenarios — grew to cover the full pipeline because the follow-up PRs would have been small and the harness only earns its keep once you can actually run it.

Pre-committed thresholds in evals/SPEC.md; first eval pass against PR #51's preview written up in evals/RESULTS-2026-05-27.md.

What's in the PR

Spec + scenarios (evals/SPEC.md, evals/scenarios/)

Two-test design: Test A (open-ended, rubric-graded, supplement positioning) vs Test B (numeric, fixture-graded, alternative positioning).
9 hand-authored scenarios as YAML: 5 Test A + 4 Test B. B5 dropped — the reform (two-child limit removal) is a no-op vs policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law. Drop documented in evals/fixtures/drift_report.md.

Runner (evals/runner/run.py)

POSTs each scenario N times to a configured chat backend, saves raw SSE + extracted text + manifest JSON per run.
--concurrency N for parallel scenarios (default sequential).
Captures tool_call_sequence, tool_call_counts_by_name, tool_failure_count, model_backend per run — needed for A/B'ing tool-registration changes.
Configurable via UK_CHAT_BACKEND_URL env or --backend-url flag; supports Vercel protection bypass.

Fixture builder (evals/runner/build_fixtures.py)

Builds Test B fixtures locally by running policyengine + policyengine_uk 2.88.20 against the EFRS 2023-24 dataset under real reform IDs (83092, 94906, 94910, 94938, 94911).
Pulls reform JSON from PE-API /uk/policy/<id> (read-only DB endpoint, unaffected by the May 12 API outage) rather than hand-rolling reform specs.
Filters each candidate field against Vahid's published blog figures with a 10% drift threshold; fields that drift more than 10% are dropped rather than tested against possibly-stale ground truth. Drift decisions documented in evals/fixtures/drift_report.md.
Uses production-aligned version stack: policyengine 0.13.0 + policyengine_uk 2.88.20 + policyengine_core 3.26.10. Separate requirements-fixtures.txt so the runner's runtime stays minimal.

Grader (evals/runner/grade.py)

Test A: emits a per-response markdown grading sheet with rubric prompts. Human grades; --threshold-check aggregates afterwards.
Test B: extracts numbers from chat prose (markdown-table-aware + line-scan fallback), diffs against fixtures with per-field tolerances, checks self-consistency SD across runs, checks anchor must_mention / must_not_say for methodology drift, and applies the SPEC's pre-committed thresholds.

Tool-usage aggregator (evals/runner/tool_usage.py) — rolls up tool_call_counts_by_name across an entire run directory. Used for comparing PR #51 (one execution tool) vs PR #55 (typed tools registered) to see whether Claude actually picks the typed surface.

Results from first pass (evals/RESULTS-2026-05-27.md)

Both tests fail by the pre-committed thresholds. Full numbers in the writeup.
Headline: Test B field accuracy 75% (need 95%); failure rate 67% (need <10%; most B-scenario timeouts are population-level questions hitting the 600s Modal HTTP timeout). Test A mean rubric 3.09 (need 4.0); fabrication 27% (need ≤20%); 10 trust-killer scores concentrated in A3 (multi-param what-if) and A5 (factual lookup) run 2.
The clean win: A4 (out-of-scope refusal) scores 5.00 on every run.
The clearest pattern: scenarios that need EFRS microdata + free-form Python time out; scenarios that need only schedule lookups complete reliably.

What's out of scope

Test A grading is human-only — the runner generates the grading sheet; a person fills it in. One grader's judgement; aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension, per-scenario means less so.
B3 extractor still false-negatives on some prose-embedded numbers — flagged in the writeup.
The eval doesn't yet cover structured-tool variants (PR Register the three dormant typed tools (UK) #55) — that's the next A/B run.

How to reproduce

# Run all scenarios against PR 51's preview
UK_CHAT_BACKEND_URL="..." python evals/runner/run.py --concurrency 4

# Grade
python evals/runner/grade.py evals/runs/<timestamp>

# After human fills in A_grading.md
python evals/runner/grade.py evals/runs/<timestamp> --threshold-check

Fixtures are pre-built and committed under evals/fixtures/pe_api/ — only re-build them if you bump policyengine_uk or want to retest drift against new published figures.

🤖 Generated with Claude Code

Scenarios

Test A — supplement positioning (chat seeded with a report's context, asked open-ended follow-ups):

ID	Title	Prompt (gist)
`a1_mechanism`	Mechanism explanation	Why does the top decile gain less in % terms (0.91%) than the 8th (1.56%) or 9th (1.54%)? Walk through the mechanism.
`a2_subset_slice`	Subset breakdown not in the report	How does the PA reform affect single parents with two children specifically — decile-by-decile gains in £?
`a3_multiparam_what_if`	Multi-parameter what-if the user invented	What if we also raised the higher-rate threshold from £50,270 to £55,000 alongside the PA raise? Compare budgetary impact and progressivity.
`a4_out_of_scope`	Out-of-scope question	How would this reform affect UK inflation over the next 12 months? (Chat should refuse cleanly — PolicyEngine doesn't model macro effects.)
`a5_factual_lookup`	Historical parameter lookup	How has the UK personal allowance changed over the last 15 years? Just the figures, no analysis.

Test B — alternative positioning (chat replicates what app-v2 reports compute):

ID	Title	Prompt (gist)	Fixture source
`b1_society_wide_pa`	Society-wide PA reform — baseline replication	Run an economy-wide comparison for UK 2025: raise income tax PA from £12,570 to £15,000 on EFRS 2023-24. Report budgetary impact, decile impacts, poverty changes.	PE-API reform 83092 (Vahid blog)
`b2_ni_it_stacked`	Stacked NI + income tax reform	UK 2026-27 economy comparison for a layered reform that adds an NI surcharge layer and raises income tax — subset of the Reeves Nov-2025 pre-Budget package.	PE-API reforms 94906, 94910, 94938
`b3_household_calc`	Household calculation — no microdata	UK 2025 figures for a single adult, age 35, £45,000 employment income, no dependents, England. Income tax, NI, household net income, MTR.	Local `policyengine_uk.Simulation`
`b4_mtr_schedule`	MTR schedule — sanity check	For a single adult in 2025/26, compute combined income tax + employee NI MTR at £10k, £20k, £30k, £50k, £75k, £100k, £125k, £150k.	Local rule-driven schedule

Dropped: b5_two_child_limit (the reform removing the two-child limit is a no-op vs policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law). See evals/fixtures/drift_report.md.

vercel · 2026-05-15T18:23:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	May 27, 2026 3:53pm

github-actions · 2026-05-15T18:23:45Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

Moves the eval design doc into the repo as evals/SPEC.md and lays out the directory structure the harness will use. Ten hand-authored scenarios are included as YAML — five Test A (chat as supplement) and five Test B (chat as alternative). Each scenario covers a distinct question shape and stress-tests a specific failure mode. No runner yet — that's the next PR. This PR is just the data and schema. See evals/README.md for layout and evals/SPEC.md for design, thresholds, and roadmap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes: 1. Replace the two made-up B scenarios (B2 PA+HRT, B5 Scotland) with ones drawn from Vahid Ahmadi's published UK analyses: - B2 → stacked NI/IT/threshold-freeze reform (Nov 2025 pre-Budget) with reference figures from uk-income-tax-ni-reforms-2025.md - B5 → remove the two-child benefit limit (Autumn Budget 2025) with reference figures from uk-two-child-limit.md This shifts Test B from "does chat match a one-off API call I made" to "does chat reproduce PolicyEngine's published analyses" — a much stronger framing. 2. Add `anchor` blocks to every scenario. Anchors carry: - must_mention: phrases a good answer must include - must_not_say: claims that would be wrong - ideal_explanation / ideal_finding: prose sketch the grader uses In v1, anchors are human-grader aids. In v2 they become inputs to an automated LLM-judge. Per-scenario anchor sourcing documented in SPEC.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- build_fixtures.py: fetches PE-API responses for B1/B2/B3/B5 and computes B4 locally via policyengine_uk (PE-API has no MTR endpoint). Output JSONs are committed so the grader doesn't refetch on every run. - Generated fixtures for B3 (household calc) and B4 (MTR schedule). - grade.py: split scalar vs list-of-dicts field comparison. List shape uses `key_by` (row identifier) + `compare` (field to diff). Adds a per-row extractor that locates the key in chat prose and pulls the nearby percentage. - b4_mtr_schedule.yaml: switch fields_to_compare to the new shape so the grader diffs combined_mtr per gross-income row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- summarise_events() now extracts tool_call_sequence, tool_call_counts_by_name, and tool_failure_count from the SSE stream. - run_all() surfaces these in each manifest row so you don't have to re-read per-run meta.json to see what Claude called. - New tool_usage.py prints a per-scenario tool-routing table from a finished run's manifest. Accepts one or more run dirs for A/B comparison. The point: when we register a new typed tool (calculate_household etc.), we need to see whether Claude actually picked it vs falling back to run_python. Reading 60 SSE logs by hand doesn't scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vahid-ahmadi

Review

This is a genuinely well-built harness and the right thing to land — additive (only evals/ + .gitignore, no production code touched), version-pinned and drift-filtered fixtures with documented drops, a clean human-graded-A / auto-graded-B split, secret redaction in the saved manifest (protection-bypass=REDACTED), and the intellectual honesty of committing a first run that fails its own pre-committed thresholds. Approving as scaffolding. My comments are about how much weight the Test B verdict can bear, since that's the part that runs automatically.

1. The headline failure rate conflates extractor misses with real failures. In grade_b_scenario, a run counts as a failure if http_error or no field produced a non-None within_tolerance — i.e. the regex extractor found nothing. So a model that answered correctly but phrased numbers in a way the scraper missed is scored identically to a 600s Modal timeout. The PR already flags that B3's extractor false-negatives on prose-embedded numbers — which means some of the 67% headline is harness brittleness, not model behaviour. The RESULTS writeup attributes the failures mostly to timeouts; worth separating "HTTP/timeout failure" from "answer-not-extractable" as distinct counters so the two can't be confused, and so a later extractor fix doesn't look like a model improvement.

2. overall_field_accuracy is a mean-of-scenario-means, not a pooled field rate. b_threshold_check does mean(field_accuracies) with one number per scenario, so a 1-field scenario weighs the same as a 20-field one. SPEC.md's "≥95% of fields within tolerance" reads like a pooled field-level rate. These can diverge meaningfully. Either pool all diff outcomes across scenarios, or make the SPEC wording match the per-scenario-mean definition — right now the metric and its stated definition don't obviously agree.

3. Prose number-scraping is doing load-bearing pass/fail work and is the weakest link. parse_number_near (sign-flip-on-nearby-"cut/reduce" heuristic, 120/200-char windows) and _extract_row_value ("combined rates are conventionally the last % in a row") are reasonable best-effort, but a wrong sign or a grabbed reform-parameter (£15,000) turns into within_tolerance=False and silently counts against accuracy — indistinguishable from a genuinely wrong model answer. For an eval whose whole purpose is trustworthy numeric verdicts, grading the model's free-form prose with regex is fragile. The manifest already captures the done event and tool metadata — consider grading Test B against structured tool outputs (or an eval-only structured response), with prose-scraping as the fallback. That also dissolves #1, since extraction would no longer be a failure mode of the model's score.

Minor: a_threshold_check only evaluates trust-killers/fabrication over fully_graded rows, so a partially-graded response with a 1 on honesty is dropped rather than failing the gate. incomplete_count is surfaced, so it's visible, but a partial trust-killer can hide — worth checking trust-killers on any graded dimension regardless of completeness.

None of these block landing the harness; they're about not over-reading the auto-generated Test B numbers until extraction is replaced or hardened. Did not execute the runner (needs a live backend + fixtures env).

vercel Bot deployed to Preview May 15, 2026 18:23 View deployment

SakshiKekre force-pushed the feat/eval-harness branch from 86af2b0 to 765afb5 Compare May 15, 2026 18:51

vercel Bot deployed to Preview May 15, 2026 18:53 View deployment

vercel Bot deployed to Preview May 19, 2026 19:57 View deployment

Add eval runner

5fac830

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 20, 2026 12:53 View deployment

Add eval grader

86dec72

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 20, 2026 14:30 View deployment

vercel Bot deployed to Preview May 21, 2026 22:39 View deployment

SakshiKekre mentioned this pull request May 27, 2026

WIP: US Python backend (latency spike — do not merge) #54

Closed

5 tasks

Generate Test B fixtures via local policyengine, filter by 10% drift

dd768db

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 13:01 View deployment

Grader: parse markdown tables for list-of-dicts extraction

3ea1f80

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 13:27 View deployment

Runner: add --concurrency for parallel scenario runs

2161c99

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 15:40 View deployment

vercel Bot deployed to Preview May 27, 2026 15:43 View deployment

SakshiKekre mentioned this pull request May 27, 2026

Register the three dormant typed tools (UK) #55

Draft

Add 2026-05-27 eval results writeup

5c715fb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 15:50 View deployment

Add Test A grading aggregates to writeup

3c11d7b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 27, 2026 15:53 View deployment

SakshiKekre mentioned this pull request May 27, 2026

Add UK chat integration (drawer on reports + standalone page) PolicyEngine/policyengine-app-v2#1036

Open

5 tasks

vahid-ahmadi mentioned this pull request May 28, 2026

Land typed reform tools to short-circuit reform-API guessing (track PR #55) #81

Open

This was referenced May 28, 2026

Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure #82

Open

Add 5 reform-API regression eval cases (closes #82) #90

Open

SakshiKekre mentioned this pull request Jun 1, 2026

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup) #95

Draft

4 tasks

vahid-ahmadi reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval harness scaffold: spec, scenarios, fixtures dir#52

Add eval harness scaffold: spec, scenarios, fixtures dir#52
SakshiKekre wants to merge 11 commits into
mainfrom
feat/eval-harness

SakshiKekre commented May 15, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

vahid-ahmadi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SakshiKekre commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

What's out of scope

How to reproduce

Scenarios

Uh oh!

vercel Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

vahid-ahmadi left a comment

Choose a reason for hiding this comment

Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SakshiKekre commented May 15, 2026 •

edited

Loading

vercel Bot commented May 15, 2026 •

edited

Loading