Conversation
Gen 11 ships the truth-table benchmark infrastructure:
- scripts/run-master-comparison.mjs (290 LOC orchestrator)
Walks 4 tiers in priority order, captures per-tier summary JSONs,
aggregates into a single REPORT.md with executive summary, per-tier
tables, cross-framework + cross-model truth tables, honest weak spots,
and reproduction instructions.
Tiers:
A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks)
B — WebVoyager 30-task curated subset (bad only, LLM judge)
C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep)
D — Tier 1 deterministic gate (regression check)
Features:
- Resumable via --skip-tier
- Single-tier override via --tier
- Hard cost cap ($25 cumulative, configurable)
- Tier failures don't stop other tiers
- Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset)
- Per-tier launch + status logged to tier-log.jsonl
- bench/external/webvoyager/curated-30.json
30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse,
auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set.
- bench/external/webvoyager/run.mjs
Added --cases-file flag so the master orchestrator can pass curated
subsets without overwriting the canonical converted cases.json.
- package.json: bench:master script
- .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md
Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks,
cost envelope, success criteria.
This commit ships the orchestration. The actual benchmark runs happen in
the next commit when bench:master executes the full battery.
Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean,
node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md
with the Tier 1 gate result.
Gen 11 ships the truth table that shows where bad actually stands across
every benchmark surface that's runnable today. The shipping artifact is
docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to
reproduce it.
What ran (4 tiers, ~3 hrs wall, ~$15):
Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day,
10 real-web tasks, gpt-5.2:
bad 34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass
browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass
bad is 4.5x faster but loses 7 tasks on pass rate
bad wins stackoverflow (+2); browser-use wins npm (-3),
wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/
arxiv/reddit/python-docs
Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites),
bad Gen 10 only, GPT-4o vision judge:
12/30 = 40% judge pass rate
100% judge-agent agreement (bad does NOT lie)
Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha
Half 1/2: ArXiv, BBC News, ESPN, GitHub
Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google
Flights, Google Map, Huggingface (long multi-step tasks
hit the 15-turn / 120s caps)
Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks:
28/30 = 93% (vs 68% on gpt-5.2 in Tier A)
cost-per-pass $0.038 (vs $0.047 on gpt-5.2)
mean wall 9.4s (vs 14.6s on gpt-5.2)
gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each)
*** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade ***
Tier D — Tier 1 deterministic gate (regression check):
FAILED both runs on local-form-multistep fast-explore at
100k+ tokens. Same dist/cli.js Gen 10 build that passed
at 47k tokens earlier today. Pure load-sensitivity flake.
NEW finding: concurrent-load sensitivity
bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier
concurrent load). All losses on extraction tasks Gen 10 had previously
fixed. browser-use barely moved (84% -> 82%). The cost cap (100k)
prevented death spirals but bad's recovery loops fire more under load.
Investigate in Gen 12.
What ships:
- scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator)
* Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md
* Resumable via --skip-tier, single-tier override via --tier
* --aggregate-only re-builds REPORT.md from existing data
* Hard cost cap ($25 cumulative)
* recomputeFromRunsJsonl() merges partial data when canonical summary missing
* Derives realWebTasks from bench/competitive/tasks/real-web/*.json
(was hardcoded — now picks up new tasks automatically)
- bench/external/webvoyager/curated-30.json
30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free,
fast to run. Site list derived dynamically in the report.
- bench/external/webvoyager/run.mjs
Added --cases-file flag so master orchestrator can pass curated subsets
without overwriting the canonical converted cases.json
- bench/external/webvoyager/evaluate.mjs
3 bug fixes:
1. Missing openai npm dep (judge couldn't import)
2. Wrong verdict field check (was testing testResult.verdict === 'PASS'
but verdict is the agent's freeform completion text, not a status —
fixed to use testResult.agentSuccess)
3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env)
- package.json: bench:master script + openai dep
- docs/GEN11-MASTER-COMPARISON.md
The truth table (167 lines, all data from this session, no stale refs)
What's NOT a regression:
- wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815'
instead of {"year":1815}. LLM-compliance issue with goal prompt.
- Tier 1 fast-explore failures: same Gen 10 build that passed earlier.
Load-sensitive flake, not code regression.
- WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too
tight for these tasks. Configuration choice.
Reproducibility:
pnpm install && pnpm build && pnpm bench:master
Each tier writes raw data to a per-tier subdir of agent-results/
master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs
and produces docs/GEN11-MASTER-COMPARISON.md (committed).
Gen 12 candidates:
1. Make bad robust to concurrent system load
2. Default to gpt-5.4 for real-web tasks (+25pp)
3. Wikipedia oracle compliance prompt fix
4. Configurable per-task max-turns for WebVoyager long-form
5. Stagehand adapter (currently a stub)
Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline: bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall bad gpt-5.4 (R1 5rep): 43/50 = 86% pass, $0.042 cpp, 8.8s mean wall ⭐ browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched) AND is 7.4x faster mean wall, 9.3x faster p95 wall. Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly approved the trade — speed advantage justifies the cost increase. Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day): w3c-html-spec-find-element: 2/5 -> 5/5 (+3) npm-package-downloads: 2/5 -> 5/5 (+3) python-docs-method-signature: 3/5 -> 5/5 (+2) wikipedia-fact-lookup: 3/5 -> 4/5 (+1) mdn-array-flatmap: 2/5 -> 3/5 (+1) arxiv-paper-abstract: 5/5 -> 4/5 (-1, variance) stackoverflow / hn / github / reddit: parity These are STRUCTURAL fixes from a smarter model on extraction tasks where the planner needs to write a precise runScript first try. The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86% — the proper rigor number per CLAUDE.md rule #6. Still beats browser-use. Per evolve protocol Phase 9 persistence: - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted - .evolve/progress.md: full round 1 writeup with per-task table - .evolve/experiments.jsonl: gen11-002 logged Next round candidates (Gen 11 evolve R2): 1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5) 2. mdn / stackoverflow stabilization 3. Re-run WebVoyager curated 30 with gpt-5.4
Exp A — WebVoyager gpt-5.4 standard caps (30 tasks):
Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp)
Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements)
Agent-judge agreement: 73% (was 100% on gpt-5.2)
Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2
Exp B — Wikipedia oracle compliance fix:
4/5 (same as before). JSON wrapping works (all reps emit {year:N}).
The 1 fail is a real extraction error (returned 1843 death year, not
1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor.
Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s):
Agent pass rate: 23/30 = 77% (+1 net vs standard caps).
Extended caps barely help: +3 wins (apple, bbc, google-flights)
offset by -2 regressions (booking — more turns = more chances to fail).
Verdict: the real gain is the MODEL UPGRADE, not the cap extension.
Key finding: gpt-5.4 agent-judge disagreement
On gpt-5.2: 100% agreement (agent never lied about success).
On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge
says FAIL). gpt-5.4 is more capable but less well-calibrated.
The honest WebVoyager number is judge rate (47%), not agent rate (73%).
Files:
- wikipedia-fact-lookup.json: stronger JSON-wrapping prompt
- curated-30-extended.json: 25-turn / 240s variant for Exp C
- .evolve/ state updates
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gen 11 ships the truth table that shows where
badactually stands across every benchmark surface that's runnable today. The shipping artifact isdocs/GEN11-MASTER-COMPARISON.mdplusscripts/run-master-comparison.mjsto reproduce it.This is NOT an agent runtime change. The agent stays at Gen 10 (the version that just merged in #60). Gen 11 is benchmark infrastructure + the honest measurement of where Gen 10 stands.
Top finding
bad Gen 10 + gpt-5.4 = the strict-upgrade configuration:
gpt-5.4 fixes the extraction tasks gpt-5.2 struggles on (mdn/npm/w3c/python-docs all 3/3) at lower cost-per-pass AND faster wall-time. If you have the gpt-5.4 budget, switch.
What ran (4 tiers, ~3 hrs wall, ~$15 cost)
Tier A — bad Gen 10 vs browser-use 0.12.6 (5-rep matched same-day)
Per-task delta:
Tier B — WebVoyager 30-task curated sample
Per-site:
Tier C — multi-model: gpt-5.4 = strict upgrade (see Top Finding above)
Tier D — Tier 1 deterministic gate
Failed both runs on
local-form-multistep fast-exploreat 100k+ tokens. Samedist/cli.jsGen 10 build that passed earlier today at 47k tokens. Pure load-sensitivity flake, not a code regression.NEW finding: concurrent-load sensitivity
bad's pass rate dropped from 74% (Gen 10 5-rep isolation) to 68% (Gen 11 4-tier concurrent load), with all losses on extraction tasks Gen 10 had previously fixed (npm 5/5→2/5, w3c 5/5→2/5). browser-use's pass rate barely moved (84% → 82%).
The cost cap (100k) prevented death spirals — no run hit the cap — but bad's recovery loops fire more often under load. Investigate in Gen 12.
What ships in this PR
scripts/run-master-comparison.mjs(~600 LOC orchestrator + aggregator)--skip-tier, single-tier override via--tier--aggregate-onlyre-builds REPORT.md from existing data$25cumulative)recomputeFromRunsJsonl()merges partial data when canonical summary missingrealWebTasksfrombench/competitive/tasks/real-web/*.json(was hardcoded)bench/external/webvoyager/curated-30.json— 30 hand-picked diverse tasksbench/external/webvoyager/run.mjs--cases-fileflagbench/external/webvoyager/evaluate.mjs— 3 bug fixes:openainpm dep (judge couldn't import)verdictfield check (wastestResult.verdict === 'PASS'butverdictis the agent's freeform completion text — fixed to usetestResult.agentSuccess)OPENAI_API_KEYwasn't loaded from.env)package.json—bench:masterscript +openaidepdocs/GEN11-MASTER-COMPARISON.md— the truth table (167 lines, all data from this session, no stale refs)Reproduction
Each tier writes raw data to a per-tier subdirectory of
agent-results/master-comparison-<ts>/(gitignored, ~580MB with videos). The aggregator reads those JSONs and producesREPORT.md. If a tier fails, its summary will be missing and the section will say so explicitly.Honest weak spots
ANTHROPIC_API_KEYnot in.env, Stagehand adapter is a stub, WebArena requires Docker + 50GB + 7 ports. Deferred to Gen 12.Gen 12 candidates
'1815'not{"year":1815})Test plan
pnpm exec tsc --noEmitcleanpnpm check:boundariescleanpnpm bench:master --tier Dstandalone — orchestrator workspnpm bench:master --aggregate-only— generates REPORT.md from existing data🤖 Generated with Claude Code