fix(bench): steering experiments get real evidence, vacuity guard, honest metering by drewstone · Pull Request #274 · tangle-network/agent-runtime

drewstone · 2026-06-12T15:54:37Z

The AppWorld EYES→HANDS ablation ran vacuously and the run only revealed it after full spend (corpus-report → UNINFORMATIVE, 0 discordant pairs). Root causes, each fixed at its seam:

Analyst blind to the judge — llmAnalyst showed the worker's output but never the verdict or the judge's failure detail, so it punted with 'no change needed' on 20/40 rounds of failing attempts. AppWorld's evaluator returns the exact failed sub-tests; the validator discarded them.
- driver evaluate now emits bounded failure_names; the adapter carries them in BenchScore.detail; the validator threads detail into verdict.notes; llmAnalyst presents the verdict + failures as ground truth and may only reply 'no change needed' on a passed verdict.
Vacuity invisible until report time — steer fire-rate wasn't recorded anywhere.
- ArmOutcome.steered/multiRound + per-arm steer {fired, opportunities} in the summary, and a vacuity guard: a treatment arm consulted in 5 multi-round instances with 0 fired steers aborts the run loud instead of burning the budget as a second blind control.
Fabricated $0 metering — inlineSandboxClient reported usage only in the nested result payload, a shape extractLlmCallEvent never matches → every iteration metered 0 tokens/$0.
- the inline client now emits a flat llm_call event the kernel meters.

Verified live (1-task probe): analyst returns targeted corrections grounded in the failed sub-tests, steer fires and is recorded, iterations meter real tokens/cost (155+370 tok, $0.0005). tsc --noEmit green. N=20 re-run in flight on the deepseek seat.

Adds the Trata hedge-bench integration (102 financial analysis tasks spanning 21 equity domains). Inline data embedding replaces Harbor/Docker; a 3-stage LLM judge (hallucination → per-move coverage → synthesis) ports Trata's grading pipeline to TypeScript via the Tangle router. - bench/src/benchmarks/trata-hedge.ts: BenchmarkAdapter with priority-ordered file embedding (MAX_DATA_CHARS=85K), 3-stage judge, extractAnalysis() fail-closed sentinel, and coverageThreshold() scoring - bench/src/trata-gate.mts: standalone direct-router gate (no sandbox); correct for text-analysis tasks vs the sandbox/opencode path - bench/src/adapters.ts: register 'trata-hedge' adapter - bench/src/run.ts: WORKER_PROVIDER passthrough for openai-compat models Baseline (deepseek-v4-flash, gemini-2.5-flash judge, n=102): resolved 4/4: 10.8% mean: 0.385 score dist: 20|29|42|0|11 best domains: rblx 50% / soc 40% / gety+irdm+tko 25% worst: apo 0% (requires specific quantitative projections) 0 hallucinations, 0 errors

Wires the GEPA optimization loop over the Trata financial analysis benchmark. The optimization surface is the analyst system prompt; GEPA reflects on missed rubric themes per generation and proposes improved prompts gated on a frozen holdout. Key properties: - analyzeGeneration uses themesMissed/themesHit from judge detail — richer failure signal than a generic binary fail (diagnoses which analytical moves the model consistently skips) - scoreBased partial credit (0/0.25/0.5/0.75/1.0) gives the optimizer a smooth gradient across the 42-task mid-band rather than a sparse binary signal - K_ROUNDS=2 option adds a self-critique refine pass (initial analysis → coverage review → revised answer) without restructuring the worker - 70/32 train/holdout split, hash-shuffled by task id for difficulty balance - mutationPrimitives target the known failure modes: missing quantitative targets, absent IRR calculations, missing peer comparisons, theme elision

…nest metering The AppWorld EYES->HANDS ablation ran vacuously: the analyst punted with 'no change needed' on half its rounds because it was never shown the judge verdict, steers that did fire were untargeted, the inline router executor metered every iteration as $0/0 tokens, and none of this was visible until corpus-report declared the contrasts UNINFORMATIVE after the full spend. - appworld driver + adapter: evaluate emits the failed sub-test names (bounded); judge detail carries them into the verdict - experiment runArm validator: threads judge detail into verdict.notes so SteerHistory carries WHAT failed, not just the scalar - llmAnalyst: shows the judge verdict + failure detail as ground truth; 'no change needed' is only legitimate on a passed verdict - runExperiment: per-treatment steer fire-rate (ArmOutcome.steered) + vacuity guard - a treatment arm consulted in 5 multi-round instances that fired 0 steers aborts the run loud instead of burning the budget as a second blind control; fire-rate surfaces in the summary - inline-sandbox-client: emit a flat llm_call event the kernel's extractLlmCallEvent can meter - iterations no longer record a fabricated $0/0-token cost

tangletools

✅ Auto-approved PR — `24741e07`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-12T15:54:44Z}

drewstone added 3 commits June 12, 2026 08:40

tangletools approved these changes Jun 12, 2026

View reviewed changes

drewstone merged commit 6ba308c into main Jun 12, 2026
1 check passed

drewstone mentioned this pull request Jun 12, 2026

feat(bench): appworld-react adapter — interactive ReAct worker for steering experiments #275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): steering experiments get real evidence, vacuity guard, honest metering#274

fix(bench): steering experiments get real evidence, vacuity guard, honest metering#274
drewstone merged 3 commits into
mainfrom
fix/eyes-hands-evidence

drewstone commented Jun 12, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 12, 2026

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 24741e07

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `24741e07`