Skip to content

fix(bench): steering experiments get real evidence, vacuity guard, honest metering#274

Merged
drewstone merged 3 commits into
mainfrom
fix/eyes-hands-evidence
Jun 12, 2026
Merged

fix(bench): steering experiments get real evidence, vacuity guard, honest metering#274
drewstone merged 3 commits into
mainfrom
fix/eyes-hands-evidence

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

The AppWorld EYES→HANDS ablation ran vacuously and the run only revealed it after full spend (corpus-report → UNINFORMATIVE, 0 discordant pairs). Root causes, each fixed at its seam:

  1. Analyst blind to the judgellmAnalyst showed the worker's output but never the verdict or the judge's failure detail, so it punted with 'no change needed' on 20/40 rounds of failing attempts. AppWorld's evaluator returns the exact failed sub-tests; the validator discarded them.
    • driver evaluate now emits bounded failure_names; the adapter carries them in BenchScore.detail; the validator threads detail into verdict.notes; llmAnalyst presents the verdict + failures as ground truth and may only reply 'no change needed' on a passed verdict.
  2. Vacuity invisible until report time — steer fire-rate wasn't recorded anywhere.
    • ArmOutcome.steered/multiRound + per-arm steer {fired, opportunities} in the summary, and a vacuity guard: a treatment arm consulted in 5 multi-round instances with 0 fired steers aborts the run loud instead of burning the budget as a second blind control.
  3. Fabricated $0 meteringinlineSandboxClient reported usage only in the nested result payload, a shape extractLlmCallEvent never matches → every iteration metered 0 tokens/$0.
    • the inline client now emits a flat llm_call event the kernel meters.

Verified live (1-task probe): analyst returns targeted corrections grounded in the failed sub-tests, steer fires and is recorded, iterations meter real tokens/cost (155+370 tok, $0.0005). tsc --noEmit green. N=20 re-run in flight on the deepseek seat.

Adds the Trata hedge-bench integration (102 financial analysis tasks
spanning 21 equity domains). Inline data embedding replaces Harbor/Docker;
a 3-stage LLM judge (hallucination → per-move coverage → synthesis) ports
Trata's grading pipeline to TypeScript via the Tangle router.

- bench/src/benchmarks/trata-hedge.ts: BenchmarkAdapter with priority-ordered
  file embedding (MAX_DATA_CHARS=85K), 3-stage judge, extractAnalysis()
  fail-closed sentinel, and coverageThreshold() scoring
- bench/src/trata-gate.mts: standalone direct-router gate (no sandbox);
  correct for text-analysis tasks vs the sandbox/opencode path
- bench/src/adapters.ts: register 'trata-hedge' adapter
- bench/src/run.ts: WORKER_PROVIDER passthrough for openai-compat models

Baseline (deepseek-v4-flash, gemini-2.5-flash judge, n=102):
  resolved 4/4: 10.8%  mean: 0.385  score dist: 20|29|42|0|11
  best domains: rblx 50% / soc 40% / gety+irdm+tko 25%
  worst: apo 0% (requires specific quantitative projections)
  0 hallucinations, 0 errors
Wires the GEPA optimization loop over the Trata financial analysis benchmark.
The optimization surface is the analyst system prompt; GEPA reflects on missed
rubric themes per generation and proposes improved prompts gated on a frozen holdout.

Key properties:
- analyzeGeneration uses themesMissed/themesHit from judge detail — richer
  failure signal than a generic binary fail (diagnoses which analytical moves
  the model consistently skips)
- scoreBased partial credit (0/0.25/0.5/0.75/1.0) gives the optimizer a smooth
  gradient across the 42-task mid-band rather than a sparse binary signal
- K_ROUNDS=2 option adds a self-critique refine pass (initial analysis →
  coverage review → revised answer) without restructuring the worker
- 70/32 train/holdout split, hash-shuffled by task id for difficulty balance
- mutationPrimitives target the known failure modes: missing quantitative
  targets, absent IRR calculations, missing peer comparisons, theme elision
…nest metering

The AppWorld EYES->HANDS ablation ran vacuously: the analyst punted with
'no change needed' on half its rounds because it was never shown the judge
verdict, steers that did fire were untargeted, the inline router executor
metered every iteration as $0/0 tokens, and none of this was visible until
corpus-report declared the contrasts UNINFORMATIVE after the full spend.

- appworld driver + adapter: evaluate emits the failed sub-test names
  (bounded); judge detail carries them into the verdict
- experiment runArm validator: threads judge detail into verdict.notes so
  SteerHistory carries WHAT failed, not just the scalar
- llmAnalyst: shows the judge verdict + failure detail as ground truth;
  'no change needed' is only legitimate on a passed verdict
- runExperiment: per-treatment steer fire-rate (ArmOutcome.steered) +
  vacuity guard - a treatment arm consulted in 5 multi-round instances
  that fired 0 steers aborts the run loud instead of burning the budget
  as a second blind control; fire-rate surfaces in the summary
- inline-sandbox-client: emit a flat llm_call event the kernel's
  extractLlmCallEvent can meter - iterations no longer record a
  fabricated $0/0-token cost

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 24741e07

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-12T15:54:44Z

@drewstone drewstone merged commit 6ba308c into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants