Skip to content

Gen 12 — honest self-assessment (content-aware fast-path verifier)#63

Merged
drewstone merged 5 commits intomainfrom
gen12-honest-self-assessment
Apr 9, 2026
Merged

Gen 12 — honest self-assessment (content-aware fast-path verifier)#63
drewstone merged 5 commits intomainfrom
gen12-honest-self-assessment

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Fixes the fast-path goal verifier that was rubber-stamping failures as successes on gpt-5.4. A single ~30-line regex gate that scans the agent's result text for self-contradicting language before bypassing LLM verification.

The bug

The fast-path verifier at runner.ts:1596 checked:

agentResult.length > 50 && recentErrors === 0 && hasScriptEvidence

This rubber-stamps success: true without reading the content. On gpt-5.4, the agent writes verbose narratives admitting failure yet still claims success:

  • "the Dec 25-26 date selection did not take effect successfully" → stamped PASS
  • "price not visible in the current page state" → stamped PASS
  • "I could not complete the exact Jan. 22 lookup" → stamped PASS

6 of 8 judge disagreements on WebVoyager were caused by this.

The fix

Added a selfContradicting regex that detects failure-admitting phrases. When found, the fast-path is blocked and the full LLM verifier runs, correctly rejecting the false claim.

Validated result

metric Pre-fix Gen 12
Agent pass rate 73% (22/30, inflated) 63% (19/30, honest)
Judge pass rate 47% (14/30) 47% (14/30)
Agreement 73% 83% (+10pp)

The fix correctly caught 2 of the 6 known false passes (booking×2). Other changes are run-to-run variance. 993/993 tests pass.

Test plan

  • Regex tested against 8 match + 5 non-match cases
  • pnpm test — 993/993 pass
  • pnpm exec tsc --noEmit clean
  • WebVoyager 30-task re-run validates agreement improvement

🤖 Generated with Claude Code

Gen 11 ships the truth-table benchmark infrastructure:

  - scripts/run-master-comparison.mjs (290 LOC orchestrator)
    Walks 4 tiers in priority order, captures per-tier summary JSONs,
    aggregates into a single REPORT.md with executive summary, per-tier
    tables, cross-framework + cross-model truth tables, honest weak spots,
    and reproduction instructions.

    Tiers:
      A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks)
      B — WebVoyager 30-task curated subset (bad only, LLM judge)
      C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep)
      D — Tier 1 deterministic gate (regression check)

    Features:
      - Resumable via --skip-tier
      - Single-tier override via --tier
      - Hard cost cap ($25 cumulative, configurable)
      - Tier failures don't stop other tiers
      - Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset)
      - Per-tier launch + status logged to tier-log.jsonl

  - bench/external/webvoyager/curated-30.json
    30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse,
    auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set.

  - bench/external/webvoyager/run.mjs
    Added --cases-file flag so the master orchestrator can pass curated
    subsets without overwriting the canonical converted cases.json.

  - package.json: bench:master script

  - .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md
    Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks,
    cost envelope, success criteria.

This commit ships the orchestration. The actual benchmark runs happen in
the next commit when bench:master executes the full battery.

Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean,
node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md
with the Tier 1 gate result.
Gen 11 ships the truth table that shows where bad actually stands across
every benchmark surface that's runnable today. The shipping artifact is
docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to
reproduce it.

What ran (4 tiers, ~3 hrs wall, ~$15):
  Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day,
           10 real-web tasks, gpt-5.2:
             bad        34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass
             browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass
             bad is 4.5x faster but loses 7 tasks on pass rate
             bad wins stackoverflow (+2); browser-use wins npm (-3),
             wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/
             arxiv/reddit/python-docs

  Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites),
           bad Gen 10 only, GPT-4o vision judge:
             12/30 = 40% judge pass rate
             100% judge-agent agreement (bad does NOT lie)
             Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha
             Half 1/2: ArXiv, BBC News, ESPN, GitHub
             Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google
             Flights, Google Map, Huggingface (long multi-step tasks
             hit the 15-turn / 120s caps)

  Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks:
             28/30 = 93% (vs 68% on gpt-5.2 in Tier A)
             cost-per-pass $0.038 (vs $0.047 on gpt-5.2)
             mean wall 9.4s (vs 14.6s on gpt-5.2)
             gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each)
             *** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade ***

  Tier D — Tier 1 deterministic gate (regression check):
             FAILED both runs on local-form-multistep fast-explore at
             100k+ tokens. Same dist/cli.js Gen 10 build that passed
             at 47k tokens earlier today. Pure load-sensitivity flake.

NEW finding: concurrent-load sensitivity
  bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier
  concurrent load). All losses on extraction tasks Gen 10 had previously
  fixed. browser-use barely moved (84% -> 82%). The cost cap (100k)
  prevented death spirals but bad's recovery loops fire more under load.
  Investigate in Gen 12.

What ships:
  - scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator)
    * Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md
    * Resumable via --skip-tier, single-tier override via --tier
    * --aggregate-only re-builds REPORT.md from existing data
    * Hard cost cap ($25 cumulative)
    * recomputeFromRunsJsonl() merges partial data when canonical summary missing
    * Derives realWebTasks from bench/competitive/tasks/real-web/*.json
      (was hardcoded — now picks up new tasks automatically)

  - bench/external/webvoyager/curated-30.json
    30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free,
    fast to run. Site list derived dynamically in the report.

  - bench/external/webvoyager/run.mjs
    Added --cases-file flag so master orchestrator can pass curated subsets
    without overwriting the canonical converted cases.json

  - bench/external/webvoyager/evaluate.mjs
    3 bug fixes:
    1. Missing openai npm dep (judge couldn't import)
    2. Wrong verdict field check (was testing testResult.verdict === 'PASS'
       but verdict is the agent's freeform completion text, not a status —
       fixed to use testResult.agentSuccess)
    3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env)

  - package.json: bench:master script + openai dep

  - docs/GEN11-MASTER-COMPARISON.md
    The truth table (167 lines, all data from this session, no stale refs)

What's NOT a regression:
  - wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815'
    instead of {"year":1815}. LLM-compliance issue with goal prompt.
  - Tier 1 fast-explore failures: same Gen 10 build that passed earlier.
    Load-sensitive flake, not code regression.
  - WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too
    tight for these tasks. Configuration choice.

Reproducibility:
  pnpm install && pnpm build && pnpm bench:master
  Each tier writes raw data to a per-tier subdir of agent-results/
  master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs
  and produces docs/GEN11-MASTER-COMPARISON.md (committed).

Gen 12 candidates:
  1. Make bad robust to concurrent system load
  2. Default to gpt-5.4 for real-web tasks (+25pp)
  3. Wikipedia oracle compliance prompt fix
  4. Configurable per-task max-turns for WebVoyager long-form
  5. Stagehand adapter (currently a stub)
Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline:

  bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall
  bad gpt-5.4 (R1 5rep):     43/50 = 86% pass, $0.042 cpp, 8.8s mean wall  ⭐
  browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall

bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched)
AND is 7.4x faster mean wall, 9.3x faster p95 wall.

Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly
approved the trade — speed advantage justifies the cost increase.

Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day):
  w3c-html-spec-find-element:    2/5 -> 5/5  (+3)
  npm-package-downloads:         2/5 -> 5/5  (+3)
  python-docs-method-signature:  3/5 -> 5/5  (+2)
  wikipedia-fact-lookup:         3/5 -> 4/5  (+1)
  mdn-array-flatmap:             2/5 -> 3/5  (+1)
  arxiv-paper-abstract:          5/5 -> 4/5  (-1, variance)
  stackoverflow / hn / github / reddit: parity

These are STRUCTURAL fixes from a smarter model on extraction tasks where
the planner needs to write a precise runScript first try.

The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86%
— the proper rigor number per CLAUDE.md rule #6. Still beats browser-use.

Per evolve protocol Phase 9 persistence:
  - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted
  - .evolve/progress.md: full round 1 writeup with per-task table
  - .evolve/experiments.jsonl: gen11-002 logged

Next round candidates (Gen 11 evolve R2):
  1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5)
  2. mdn / stackoverflow stabilization
  3. Re-run WebVoyager curated 30 with gpt-5.4
Exp A — WebVoyager gpt-5.4 standard caps (30 tasks):
  Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp)
  Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements)
  Agent-judge agreement: 73% (was 100% on gpt-5.2)
  Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2

Exp B — Wikipedia oracle compliance fix:
  4/5 (same as before). JSON wrapping works (all reps emit {year:N}).
  The 1 fail is a real extraction error (returned 1843 death year, not
  1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor.

Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s):
  Agent pass rate: 23/30 = 77% (+1 net vs standard caps).
  Extended caps barely help: +3 wins (apple, bbc, google-flights)
  offset by -2 regressions (booking — more turns = more chances to fail).
  Verdict: the real gain is the MODEL UPGRADE, not the cap extension.

Key finding: gpt-5.4 agent-judge disagreement
  On gpt-5.2: 100% agreement (agent never lied about success).
  On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge
  says FAIL). gpt-5.4 is more capable but less well-calibrated.
  The honest WebVoyager number is judge rate (47%), not agent rate (73%).

Files:
  - wikipedia-fact-lookup.json: stronger JSON-wrapping prompt
  - curated-30-extended.json: 25-turn / 240s variant for Exp C
  - .evolve/ state updates
The fast-path goal verifier at runner.ts:1596 checks:
  agentResult.length > 50 && recentErrors === 0 && hasScriptEvidence

This rubber-stamps success without reading the result content. On gpt-5.4,
the agent writes verbose narratives admitting failure ("could not complete",
"price not visible", "did not take effect") yet still marks success: true.

In Gen 11 evolve R2, 6 of 8 judge disagreements were caused by this:
- Booking: "date selection did not take effect" → fast-path stamped PASS
- Google Flights: "could not complete the Jan. 22 lookup" → PASS
- Google Map: "fifth qualifying salon is not visible" → PASS
- GitHub: "sorted by Best match, not confirmed most starred" → PASS
- Wolfram: "did not return a visible answer" → PASS

Fix: add a selfContradicting regex gate that scans the result text for
failure-admitting phrases. When found, the fast-path is blocked and the
full LLM verifier runs instead, correctly marking these as failures.

The regex catches:
  could not complete/find/fulfill/verify/confirm/locate/access/extract/retrieve
  not visible/available/found/present/accessible/displayed/shown/confirmed/verified
  did not take effect/work/succeed/load/return
  unable to find/complete/verify/access/extract/retrieve
  no visible answer/result/data/content
  no results found/returned/available
  failed/failure to find/complete/set/select/navigate
  unfortunately / I was unable / task is incomplete

Tested: 8 match cases + 5 non-match cases all pass.

Expected impact:
  Agent self-report accuracy on WebVoyager goes from 73% (inflated) to
  ~53% (honest). Agent-judge agreement goes from 73% back toward 100%.
  The honest agent pass rate is now trustworthy — when bad says it
  succeeded, it actually did.

993/993 tests pass. TypeScript clean.
@drewstone drewstone merged commit 7f7ce69 into main Apr 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant