Gen 12 — honest self-assessment (content-aware fast-path verifier) by drewstone · Pull Request #63 · tangle-network/browser-agent-driver

drewstone · 2026-04-09T09:09:08Z

Summary

Fixes the fast-path goal verifier that was rubber-stamping failures as successes on gpt-5.4. A single ~30-line regex gate that scans the agent's result text for self-contradicting language before bypassing LLM verification.

The bug

The fast-path verifier at runner.ts:1596 checked:

agentResult.length > 50 && recentErrors === 0 && hasScriptEvidence

This rubber-stamps success: true without reading the content. On gpt-5.4, the agent writes verbose narratives admitting failure yet still claims success:

"the Dec 25-26 date selection did not take effect successfully" → stamped PASS
"price not visible in the current page state" → stamped PASS
"I could not complete the exact Jan. 22 lookup" → stamped PASS

6 of 8 judge disagreements on WebVoyager were caused by this.

The fix

Added a selfContradicting regex that detects failure-admitting phrases. When found, the fast-path is blocked and the full LLM verifier runs, correctly rejecting the false claim.

Validated result

metric	Pre-fix	Gen 12
Agent pass rate	73% (22/30, inflated)	63% (19/30, honest)
Judge pass rate	47% (14/30)	47% (14/30)
Agreement	73%	83% (+10pp) ✅

The fix correctly caught 2 of the 6 known false passes (booking×2). Other changes are run-to-run variance. 993/993 tests pass.

Test plan

Regex tested against 8 match + 5 non-match cases
pnpm test — 993/993 pass
pnpm exec tsc --noEmit clean
WebVoyager 30-task re-run validates agreement improvement

🤖 Generated with Claude Code

Gen 11 ships the truth-table benchmark infrastructure: - scripts/run-master-comparison.mjs (290 LOC orchestrator) Walks 4 tiers in priority order, captures per-tier summary JSONs, aggregates into a single REPORT.md with executive summary, per-tier tables, cross-framework + cross-model truth tables, honest weak spots, and reproduction instructions. Tiers: A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks) B — WebVoyager 30-task curated subset (bad only, LLM judge) C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep) D — Tier 1 deterministic gate (regression check) Features: - Resumable via --skip-tier - Single-tier override via --tier - Hard cost cap ($25 cumulative, configurable) - Tier failures don't stop other tiers - Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset) - Per-tier launch + status logged to tier-log.jsonl - bench/external/webvoyager/curated-30.json 30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse, auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set. - bench/external/webvoyager/run.mjs Added --cases-file flag so the master orchestrator can pass curated subsets without overwriting the canonical converted cases.json. - package.json: bench:master script - .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks, cost envelope, success criteria. This commit ships the orchestration. The actual benchmark runs happen in the next commit when bench:master executes the full battery. Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean, node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md with the Tier 1 gate result.

Gen 11 ships the truth table that shows where bad actually stands across every benchmark surface that's runnable today. The shipping artifact is docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to reproduce it. What ran (4 tiers, ~3 hrs wall, ~$15): Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day, 10 real-web tasks, gpt-5.2: bad 34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass bad is 4.5x faster but loses 7 tasks on pass rate bad wins stackoverflow (+2); browser-use wins npm (-3), wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/ arxiv/reddit/python-docs Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites), bad Gen 10 only, GPT-4o vision judge: 12/30 = 40% judge pass rate 100% judge-agent agreement (bad does NOT lie) Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha Half 1/2: ArXiv, BBC News, ESPN, GitHub Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google Flights, Google Map, Huggingface (long multi-step tasks hit the 15-turn / 120s caps) Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks: 28/30 = 93% (vs 68% on gpt-5.2 in Tier A) cost-per-pass $0.038 (vs $0.047 on gpt-5.2) mean wall 9.4s (vs 14.6s on gpt-5.2) gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each) *** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade *** Tier D — Tier 1 deterministic gate (regression check): FAILED both runs on local-form-multistep fast-explore at 100k+ tokens. Same dist/cli.js Gen 10 build that passed at 47k tokens earlier today. Pure load-sensitivity flake. NEW finding: concurrent-load sensitivity bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier concurrent load). All losses on extraction tasks Gen 10 had previously fixed. browser-use barely moved (84% -> 82%). The cost cap (100k) prevented death spirals but bad's recovery loops fire more under load. Investigate in Gen 12. What ships: - scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator) * Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md * Resumable via --skip-tier, single-tier override via --tier * --aggregate-only re-builds REPORT.md from existing data * Hard cost cap ($25 cumulative) * recomputeFromRunsJsonl() merges partial data when canonical summary missing * Derives realWebTasks from bench/competitive/tasks/real-web/*.json (was hardcoded — now picks up new tasks automatically) - bench/external/webvoyager/curated-30.json 30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free, fast to run. Site list derived dynamically in the report. - bench/external/webvoyager/run.mjs Added --cases-file flag so master orchestrator can pass curated subsets without overwriting the canonical converted cases.json - bench/external/webvoyager/evaluate.mjs 3 bug fixes: 1. Missing openai npm dep (judge couldn't import) 2. Wrong verdict field check (was testing testResult.verdict === 'PASS' but verdict is the agent's freeform completion text, not a status — fixed to use testResult.agentSuccess) 3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env) - package.json: bench:master script + openai dep - docs/GEN11-MASTER-COMPARISON.md The truth table (167 lines, all data from this session, no stale refs) What's NOT a regression: - wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815' instead of {"year":1815}. LLM-compliance issue with goal prompt. - Tier 1 fast-explore failures: same Gen 10 build that passed earlier. Load-sensitive flake, not code regression. - WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too tight for these tasks. Configuration choice. Reproducibility: pnpm install && pnpm build && pnpm bench:master Each tier writes raw data to a per-tier subdir of agent-results/ master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs and produces docs/GEN11-MASTER-COMPARISON.md (committed). Gen 12 candidates: 1. Make bad robust to concurrent system load 2. Default to gpt-5.4 for real-web tasks (+25pp) 3. Wikipedia oracle compliance prompt fix 4. Configurable per-task max-turns for WebVoyager long-form 5. Stagehand adapter (currently a stub)

Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline: bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall bad gpt-5.4 (R1 5rep): 43/50 = 86% pass, $0.042 cpp, 8.8s mean wall ⭐ browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched) AND is 7.4x faster mean wall, 9.3x faster p95 wall. Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly approved the trade — speed advantage justifies the cost increase. Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day): w3c-html-spec-find-element: 2/5 -> 5/5 (+3) npm-package-downloads: 2/5 -> 5/5 (+3) python-docs-method-signature: 3/5 -> 5/5 (+2) wikipedia-fact-lookup: 3/5 -> 4/5 (+1) mdn-array-flatmap: 2/5 -> 3/5 (+1) arxiv-paper-abstract: 5/5 -> 4/5 (-1, variance) stackoverflow / hn / github / reddit: parity These are STRUCTURAL fixes from a smarter model on extraction tasks where the planner needs to write a precise runScript first try. The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86% — the proper rigor number per CLAUDE.md rule #6. Still beats browser-use. Per evolve protocol Phase 9 persistence: - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted - .evolve/progress.md: full round 1 writeup with per-task table - .evolve/experiments.jsonl: gen11-002 logged Next round candidates (Gen 11 evolve R2): 1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5) 2. mdn / stackoverflow stabilization 3. Re-run WebVoyager curated 30 with gpt-5.4

Exp A — WebVoyager gpt-5.4 standard caps (30 tasks): Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp) Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements) Agent-judge agreement: 73% (was 100% on gpt-5.2) Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2 Exp B — Wikipedia oracle compliance fix: 4/5 (same as before). JSON wrapping works (all reps emit {year:N}). The 1 fail is a real extraction error (returned 1843 death year, not 1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor. Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s): Agent pass rate: 23/30 = 77% (+1 net vs standard caps). Extended caps barely help: +3 wins (apple, bbc, google-flights) offset by -2 regressions (booking — more turns = more chances to fail). Verdict: the real gain is the MODEL UPGRADE, not the cap extension. Key finding: gpt-5.4 agent-judge disagreement On gpt-5.2: 100% agreement (agent never lied about success). On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge says FAIL). gpt-5.4 is more capable but less well-calibrated. The honest WebVoyager number is judge rate (47%), not agent rate (73%). Files: - wikipedia-fact-lookup.json: stronger JSON-wrapping prompt - curated-30-extended.json: 25-turn / 240s variant for Exp C - .evolve/ state updates

The fast-path goal verifier at runner.ts:1596 checks: agentResult.length > 50 && recentErrors === 0 && hasScriptEvidence This rubber-stamps success without reading the result content. On gpt-5.4, the agent writes verbose narratives admitting failure ("could not complete", "price not visible", "did not take effect") yet still marks success: true. In Gen 11 evolve R2, 6 of 8 judge disagreements were caused by this: - Booking: "date selection did not take effect" → fast-path stamped PASS - Google Flights: "could not complete the Jan. 22 lookup" → PASS - Google Map: "fifth qualifying salon is not visible" → PASS - GitHub: "sorted by Best match, not confirmed most starred" → PASS - Wolfram: "did not return a visible answer" → PASS Fix: add a selfContradicting regex gate that scans the result text for failure-admitting phrases. When found, the fast-path is blocked and the full LLM verifier runs instead, correctly marking these as failures. The regex catches: could not complete/find/fulfill/verify/confirm/locate/access/extract/retrieve not visible/available/found/present/accessible/displayed/shown/confirmed/verified did not take effect/work/succeed/load/return unable to find/complete/verify/access/extract/retrieve no visible answer/result/data/content no results found/returned/available failed/failure to find/complete/set/select/navigate unfortunately / I was unable / task is incomplete Tested: 8 match cases + 5 non-match cases all pass. Expected impact: Agent self-report accuracy on WebVoyager goes from 73% (inflated) to ~53% (honest). Agent-judge agreement goes from 73% back toward 100%. The honest agent pass rate is now trustworthy — when bad says it succeeded, it actually did. 993/993 tests pass. TypeScript clean.

drewstone added 5 commits April 8, 2026 21:47

drewstone merged commit 7f7ce69 into main Apr 9, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gen 12 — honest self-assessment (content-aware fast-path verifier)#63

Gen 12 — honest self-assessment (content-aware fast-path verifier)#63
drewstone merged 5 commits intomainfrom
gen12-honest-self-assessment

drewstone commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 9, 2026

Summary

The bug

The fix

Validated result

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant