fix(bench): steering experiments get real evidence, vacuity guard, honest metering#274
Merged
Conversation
Adds the Trata hedge-bench integration (102 financial analysis tasks spanning 21 equity domains). Inline data embedding replaces Harbor/Docker; a 3-stage LLM judge (hallucination → per-move coverage → synthesis) ports Trata's grading pipeline to TypeScript via the Tangle router. - bench/src/benchmarks/trata-hedge.ts: BenchmarkAdapter with priority-ordered file embedding (MAX_DATA_CHARS=85K), 3-stage judge, extractAnalysis() fail-closed sentinel, and coverageThreshold() scoring - bench/src/trata-gate.mts: standalone direct-router gate (no sandbox); correct for text-analysis tasks vs the sandbox/opencode path - bench/src/adapters.ts: register 'trata-hedge' adapter - bench/src/run.ts: WORKER_PROVIDER passthrough for openai-compat models Baseline (deepseek-v4-flash, gemini-2.5-flash judge, n=102): resolved 4/4: 10.8% mean: 0.385 score dist: 20|29|42|0|11 best domains: rblx 50% / soc 40% / gety+irdm+tko 25% worst: apo 0% (requires specific quantitative projections) 0 hallucinations, 0 errors
Wires the GEPA optimization loop over the Trata financial analysis benchmark. The optimization surface is the analyst system prompt; GEPA reflects on missed rubric themes per generation and proposes improved prompts gated on a frozen holdout. Key properties: - analyzeGeneration uses themesMissed/themesHit from judge detail — richer failure signal than a generic binary fail (diagnoses which analytical moves the model consistently skips) - scoreBased partial credit (0/0.25/0.5/0.75/1.0) gives the optimizer a smooth gradient across the 42-task mid-band rather than a sparse binary signal - K_ROUNDS=2 option adds a self-critique refine pass (initial analysis → coverage review → revised answer) without restructuring the worker - 70/32 train/holdout split, hash-shuffled by task id for difficulty balance - mutationPrimitives target the known failure modes: missing quantitative targets, absent IRR calculations, missing peer comparisons, theme elision
…nest metering The AppWorld EYES->HANDS ablation ran vacuously: the analyst punted with 'no change needed' on half its rounds because it was never shown the judge verdict, steers that did fire were untargeted, the inline router executor metered every iteration as $0/0 tokens, and none of this was visible until corpus-report declared the contrasts UNINFORMATIVE after the full spend. - appworld driver + adapter: evaluate emits the failed sub-test names (bounded); judge detail carries them into the verdict - experiment runArm validator: threads judge detail into verdict.notes so SteerHistory carries WHAT failed, not just the scalar - llmAnalyst: shows the judge verdict + failure detail as ground truth; 'no change needed' is only legitimate on a passed verdict - runExperiment: per-treatment steer fire-rate (ArmOutcome.steered) + vacuity guard - a treatment arm consulted in 5 multi-round instances that fired 0 steers aborts the run loud instead of burning the budget as a second blind control; fire-rate surfaces in the summary - inline-sandbox-client: emit a flat llm_call event the kernel's extractLlmCallEvent can meter - iterations no longer record a fabricated $0/0-token cost
tangletools
approved these changes
Jun 12, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 24741e07
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-12T15:54:44Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The AppWorld EYES→HANDS ablation ran vacuously and the run only revealed it after full spend (
corpus-report→ UNINFORMATIVE, 0 discordant pairs). Root causes, each fixed at its seam:llmAnalystshowed the worker's output but never the verdict or the judge's failure detail, so it punted with 'no change needed' on 20/40 rounds of failing attempts. AppWorld's evaluator returns the exact failed sub-tests; the validator discarded them.evaluatenow emits boundedfailure_names; the adapter carries them inBenchScore.detail; the validator threads detail intoverdict.notes;llmAnalystpresents the verdict + failures as ground truth and may only reply 'no change needed' on a passed verdict.ArmOutcome.steered/multiRound+ per-armsteer {fired, opportunities}in the summary, and a vacuity guard: a treatment arm consulted in 5 multi-round instances with 0 fired steers aborts the run loud instead of burning the budget as a second blind control.inlineSandboxClientreported usage only in the nestedresultpayload, a shapeextractLlmCallEventnever matches → every iteration metered 0 tokens/$0.llm_callevent the kernel meters.Verified live (1-task probe): analyst returns targeted corrections grounded in the failed sub-tests, steer fires and is recorded, iterations meter real tokens/cost (155+370 tok, $0.0005).
tsc --noEmitgreen. N=20 re-run in flight on the deepseek seat.