Add eval harness scaffold: spec, scenarios, fixtures dir#52
Add eval harness scaffold: spec, scenarios, fixtures dir#52SakshiKekre wants to merge 11 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview is ready.
|
Moves the eval design doc into the repo as evals/SPEC.md and lays out the directory structure the harness will use. Ten hand-authored scenarios are included as YAML — five Test A (chat as supplement) and five Test B (chat as alternative). Each scenario covers a distinct question shape and stress-tests a specific failure mode. No runner yet — that's the next PR. This PR is just the data and schema. See evals/README.md for layout and evals/SPEC.md for design, thresholds, and roadmap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
86af2b0 to
765afb5
Compare
Two changes:
1. Replace the two made-up B scenarios (B2 PA+HRT, B5 Scotland) with
ones drawn from Vahid Ahmadi's published UK analyses:
- B2 → stacked NI/IT/threshold-freeze reform (Nov 2025 pre-Budget)
with reference figures from uk-income-tax-ni-reforms-2025.md
- B5 → remove the two-child benefit limit (Autumn Budget 2025)
with reference figures from uk-two-child-limit.md
This shifts Test B from "does chat match a one-off API call I made"
to "does chat reproduce PolicyEngine's published analyses" — a much
stronger framing.
2. Add `anchor` blocks to every scenario. Anchors carry:
- must_mention: phrases a good answer must include
- must_not_say: claims that would be wrong
- ideal_explanation / ideal_finding: prose sketch the grader uses
In v1, anchors are human-grader aids. In v2 they become inputs to
an automated LLM-judge.
Per-scenario anchor sourcing documented in SPEC.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_fixtures.py: fetches PE-API responses for B1/B2/B3/B5 and computes B4 locally via policyengine_uk (PE-API has no MTR endpoint). Output JSONs are committed so the grader doesn't refetch on every run. - Generated fixtures for B3 (household calc) and B4 (MTR schedule). - grade.py: split scalar vs list-of-dicts field comparison. List shape uses `key_by` (row identifier) + `compare` (field to diff). Adds a per-row extractor that locates the key in chat prose and pulls the nearby percentage. - b4_mtr_schedule.yaml: switch fields_to_compare to the new shape so the grader diffs combined_mtr per gross-income row.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- summarise_events() now extracts tool_call_sequence, tool_call_counts_by_name, and tool_failure_count from the SSE stream. - run_all() surfaces these in each manifest row so you don't have to re-read per-run meta.json to see what Claude called. - New tool_usage.py prints a per-scenario tool-routing table from a finished run's manifest. Accepts one or more run dirs for A/B comparison. The point: when we register a new typed tool (calculate_household etc.), we need to see whether Claude actually picked it vs falling back to run_python. Reading 60 SSE logs by hand doesn't scale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vahid-ahmadi
left a comment
There was a problem hiding this comment.
Review
This is a genuinely well-built harness and the right thing to land — additive (only evals/ + .gitignore, no production code touched), version-pinned and drift-filtered fixtures with documented drops, a clean human-graded-A / auto-graded-B split, secret redaction in the saved manifest (protection-bypass=REDACTED), and the intellectual honesty of committing a first run that fails its own pre-committed thresholds. Approving as scaffolding. My comments are about how much weight the Test B verdict can bear, since that's the part that runs automatically.
1. The headline failure rate conflates extractor misses with real failures. In grade_b_scenario, a run counts as a failure if http_error or no field produced a non-None within_tolerance — i.e. the regex extractor found nothing. So a model that answered correctly but phrased numbers in a way the scraper missed is scored identically to a 600s Modal timeout. The PR already flags that B3's extractor false-negatives on prose-embedded numbers — which means some of the 67% headline is harness brittleness, not model behaviour. The RESULTS writeup attributes the failures mostly to timeouts; worth separating "HTTP/timeout failure" from "answer-not-extractable" as distinct counters so the two can't be confused, and so a later extractor fix doesn't look like a model improvement.
2. overall_field_accuracy is a mean-of-scenario-means, not a pooled field rate. b_threshold_check does mean(field_accuracies) with one number per scenario, so a 1-field scenario weighs the same as a 20-field one. SPEC.md's "≥95% of fields within tolerance" reads like a pooled field-level rate. These can diverge meaningfully. Either pool all diff outcomes across scenarios, or make the SPEC wording match the per-scenario-mean definition — right now the metric and its stated definition don't obviously agree.
3. Prose number-scraping is doing load-bearing pass/fail work and is the weakest link. parse_number_near (sign-flip-on-nearby-"cut/reduce" heuristic, 120/200-char windows) and _extract_row_value ("combined rates are conventionally the last % in a row") are reasonable best-effort, but a wrong sign or a grabbed reform-parameter (£15,000) turns into within_tolerance=False and silently counts against accuracy — indistinguishable from a genuinely wrong model answer. For an eval whose whole purpose is trustworthy numeric verdicts, grading the model's free-form prose with regex is fragile. The manifest already captures the done event and tool metadata — consider grading Test B against structured tool outputs (or an eval-only structured response), with prose-scraping as the fallback. That also dissolves #1, since extraction would no longer be a failure mode of the model's score.
Minor: a_threshold_check only evaluates trust-killers/fabrication over fully_graded rows, so a partially-graded response with a 1 on honesty is dropped rather than failing the gate. incomplete_count is surfaced, so it's visible, but a partial trust-killer can hide — worth checking trust-killers on any graded dimension regardless of completeness.
None of these block landing the harness; they're about not over-reading the auto-generated Test B numbers until extraction is replaced or hardened. Did not execute the runner (needs a live backend + fixtures env).
Summary
End-to-end eval harness for uk-chat: scenarios, fixtures, runner, grader, and a first run's results. Originally scoped as just the scaffold + scenarios — grew to cover the full pipeline because the follow-up PRs would have been small and the harness only earns its keep once you can actually run it.
Pre-committed thresholds in
evals/SPEC.md; first eval pass against PR #51's preview written up inevals/RESULTS-2026-05-27.md.What's in the PR
Spec + scenarios (
evals/SPEC.md,evals/scenarios/)policyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law. Drop documented inevals/fixtures/drift_report.md.Runner (
evals/runner/run.py)--concurrency Nfor parallel scenarios (default sequential).tool_call_sequence,tool_call_counts_by_name,tool_failure_count,model_backendper run — needed for A/B'ing tool-registration changes.UK_CHAT_BACKEND_URLenv or--backend-urlflag; supports Vercel protection bypass.Fixture builder (
evals/runner/build_fixtures.py)policyengine+policyengine_uk 2.88.20against the EFRS 2023-24 dataset under real reform IDs (83092, 94906, 94910, 94938, 94911)./uk/policy/<id>(read-only DB endpoint, unaffected by the May 12 API outage) rather than hand-rolling reform specs.evals/fixtures/drift_report.md.policyengine 0.13.0+policyengine_uk 2.88.20+policyengine_core 3.26.10. Separaterequirements-fixtures.txtso the runner's runtime stays minimal.Grader (
evals/runner/grade.py)--threshold-checkaggregates afterwards.must_mention/must_not_sayfor methodology drift, and applies the SPEC's pre-committed thresholds.Tool-usage aggregator (
evals/runner/tool_usage.py) — rolls uptool_call_counts_by_nameacross an entire run directory. Used for comparing PR #51 (one execution tool) vs PR #55 (typed tools registered) to see whether Claude actually picks the typed surface.Results from first pass (
evals/RESULTS-2026-05-27.md)What's out of scope
How to reproduce
Fixtures are pre-built and committed under
evals/fixtures/pe_api/— only re-build them if you bumppolicyengine_ukor want to retest drift against new published figures.🤖 Generated with Claude Code
Scenarios
Test A — supplement positioning (chat seeded with a report's context, asked open-ended follow-ups):
a1_mechanisma2_subset_slicea3_multiparam_what_ifa4_out_of_scopea5_factual_lookupTest B — alternative positioning (chat replicates what app-v2 reports compute):
b1_society_wide_pab2_ni_it_stackedb3_household_calcpolicyengine_uk.Simulationb4_mtr_scheduleDropped:
b5_two_child_limit(the reform removing the two-child limit is a no-op vspolicyengine_uk 2.88.20's baseline because Autumn Budget 2025 already removed it in current law). Seeevals/fixtures/drift_report.md.