Skip to content

Gen 10 — DOM index extraction + bigger snapshot + cost cap#60

Merged
drewstone merged 3 commits intomainfrom
gen10-dom-index-extraction
Apr 9, 2026
Merged

Gen 10 — DOM index extraction + bigger snapshot + cost cap#60
drewstone merged 3 commits intomainfrom
gen10-dom-index-extraction

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Status: DRAFT — awaiting real-web gauntlet validation

This is the capability change Gen 9 was missing. Three coordinated changes ship together:

change what it solves
A extractWithIndex action — pick-by-content over pick-by-selector npm/mdn/python-docs failures where the LLM can't write a precise selector for data it hasn't seen
C bigger snapshot + content line preservation (term/definition/code/pre/paragraph) MDN/Python docs/W3C spec/arxiv pages where the data lives in <dl>/<dt>/<dd>/<code> and the old budgetSnapshot filter dropped them as "decorative"
cost cap 100K-token per-case hard cap with cost_cap_exceeded abort the Gen 9 reddit-style death spirals (132K-173K tokens / case)

Why this is NOT another Gen 9

Gen 9 failed because it was a mechanism change without a capability change: more turns for the same LLM picking the same wrong selector. Gen 10 is a capability change — the LLM gets new information (a numbered, text-rich element index) that it didn't have before. Browser-use's per-action loop wins on the failing tasks because of exactly this mechanism.

What ships

A — extractWithIndex

  • New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
  • Returns numbered list: [0] <p> data-testid=\"downloads\" selector: [data-testid=\"downloads\"] text: Weekly downloads: 26,543,821
  • Wide query + content filter beats narrow selector. The LLM picks by index in the next turn.
  • Wired into per-action loop (handler at `runner.ts`), executePlan (capture into `lastExtractOutput`, fall-through with match list), planner system prompt, validateAction parser, supervisor signature.
  • Helper at `src/drivers/extract-with-index.ts` (browser-side query, hidden-element skipping, stable-selector building, 80-match cap).

C — bigger snapshot

  • `budgetSnapshot` filter now preserves `term`/`definition`/`code`/`pre`/`paragraph` content lines
  • Default cap raised 16k → 24k chars
  • Planner cap raised 12k → 24k (planner is the most important caller for extraction tasks — it writes the runScript on first observation)
  • Same-page snapshot stays at 8k (LLM has already seen the page)
  • Empirical verification: Playwright DOES emit term/definition/code lines with text content. The bug was the filter, not the snapshot pipeline.

Cost cap

  • `RunState.totalTokensUsed` accumulator + `tokenBudget` (default 100K, override via `Scenario.tokenBudget` or `BAD_TOKEN_BUDGET` env)
  • `isTokenBudgetExhausted` checked at top of every loop iteration before next LLM call
  • Returns `success: false, reason: 'cost_cap_exceeded: ...'` so the bench harness reports it cleanly
  • Calibration: Gen 8 real-web ~6K, tier 1 form-multistep 60K, Gen 9 death spirals 132K-173K → 100K is well above normal max, well below death spirals

Tests

981 → 999 passing (+18 net new):

  • `tests/budget-snapshot.test.ts` — 6 (filter preservation, content lines, priority bucket, paragraph handling)
  • `tests/extract-with-index.test.ts` — 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful fail, stable selector, formatter, parser via Brain.parse)
  • `tests/run-state.test.ts` — 7 in 'Gen 10 cost cap' describe (default, env override, accumulator, exhaustion threshold)
  • `tests/runner-execute-plan.test.ts` — 2 (extractWithIndex deviation with match list, cost cap exhaustion)

Gates

  • ✅ TypeScript clean (`pnpm exec tsc --noEmit`)
  • ✅ Boundaries clean (`pnpm check:boundaries`)
  • ✅ Full test suite (`pnpm test`) — 981/981 → growing to 999/999
  • ✅ Tier1 deterministic gate PASSED
  • ⏳ 3-rep real-web gauntlet — RUNNING
  • ⏳ 5-rep promotion gate — gated on 3-rep wins

Lesson from Gen 9 baked in

```
Per CLAUDE.md rule #6: NO PROMOTION until 5-rep validation shows pass-rate
gain ≥+2 AND no per-task cost regression >2x AND no death-spiral runs.
```

This PR will NOT merge until the gauntlet data is in. If 3-rep doesn't show movement, the build still has value (the cost cap is a real safety net for any future iteration mechanism, the snapshot filter fix is independently useful, and extractWithIndex is a real primitive). Even in the worst case it's a strict superset of Gen 8 behavior — extractWithIndex never fires unless the LLM emits it.

What 3-rep validation will reveal

The 4 tasks I expect to move:

  • mdn-array-flatmap (2/3 → 3/3): `
    ` content now visible in the snapshot AND extractWithIndex with `contains:'flatMap'` returns the signature directly
  • python-docs-method-signature (2/3 → 3/3): same fix — `` content in the snapshot
  • npm-package-downloads (1/3 → 2-3/3): extractWithIndex with `contains:'downloads'` reads the XHR-loaded text directly
  • w3c-html-spec-find-element (3/3 → 3/3): bigger snapshot helps long-document navigation

The risk: tasks where extractWithIndex returns 80 matches and the LLM picks the wrong one.

Refs

  • Pursuit doc: `.evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md`
  • Gen 9 retrospective: same doc
  • Branch: `gen10-dom-index-extraction`

Three coordinated changes that ship together as Gen 10:

A) extractWithIndex action — pick-by-content over pick-by-selector

   New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
   that returns a numbered, text-rich list of every visible element matching
   the query. The agent picks elements by index in the next turn.

   This is the architectural fix Gen 9 was missing: instead of asking the LLM
   to write a precise CSS selector for data it hasn't seen yet (the failure
   mode on npm/mdn/python-docs), the wide query finds candidates and the
   response shows actual textContent so the LLM can pick by content match.

   Wired into:
   - src/types.ts (ExtractWithIndexAction type, added to Action union)
   - src/brain/index.ts (validateAction parser, system prompt, planner prompt,
     data-extraction rule #25 explaining when to prefer extractWithIndex over
     runScript on extraction tasks)
   - src/drivers/extract-with-index.ts (browser-side query helper, returns
     {index, tag, text, attributes, selector} for each visible match, capped
     at 80 matches)
   - src/drivers/playwright.ts (driver.execute dispatch, returns formatted
     output as data so executePlan can capture it like runScript)
   - src/runner/runner.ts (per-action loop handler with feedback injection,
     executePlan capture into lastExtractOutput, plan-ends-with-extract
     fall-through to per-action loop with the match list as REPLAN context)
   - src/supervisor/policy.ts (action signature for stuck-detection)

C) Bigger snapshot + content-line preservation

   src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
   paragraph content lines (which previously got dropped as "decorative" by
   the interactive-only filter). These are exactly the lines that carry the
   data agents need on MDN/Python docs/W3C spec/arxiv pages.

   Budgets raised:
   - Default budgetSnapshot cap: 16k → 24k chars
   - Decide() new-page snapshot: 16k → 24k
   - Planner snapshot: 12k → 24k (planner is the most important caller for
     extraction tasks because it writes the runScript on the first observation)

   Same-page snapshot stays at 8k (after the LLM has already seen the page).

   Empirical verification: probed Playwright's locator.ariaSnapshot() output
   on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
   — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
   content. The bug was in the budgetSnapshot filter dropping them, not in
   the snapshot pipeline missing them.

Cost cap (mandatory safety net for any iteration-based mechanism)

   src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
   100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
   isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
   top of every loop iteration (before the next LLM call) and returns
   `cost_cap_exceeded` if exceeded.

   Calibration:
   - Gen 8 real-web mean: ~6k tokens (well under 100k)
   - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
   - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)

   100k = above any normal case I've measured, well below any death spiral.
   Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
   $0.32/173k tokens) within 5–8 turns of futility instead of running for
   the full case timeout.

Tests: 18 new (981 total, +18 from baseline)
   - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
     priority bucket, paragraph handling)
   - tests/extract-with-index.test.ts: 13 (browser-side query, contains
     filter, hidden element skipping, invalid selector graceful failure,
     stable selector building, formatter, parser via Brain.parse)
   - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
   - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
     with match list, cost cap exhaustion)

Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.

Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md
…ll-through

Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original
Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the
pass rate AND introduced cost regressions on previously-passing tasks (reddit
death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation).

In Gen 10 the same code is safe and useful for two reasons:
  1. Cost cap (100k tokens, default) bounds any death spiral
  2. Per-action loop has extractWithIndex available — when the deviation
     reason mentions "runScript returned no meaningful output", the LLM can
     respond with extractWithIndex (per data-extraction rule #25) instead
     of retrying the same wrong selector

What this brings into Gen 10:

isMeaningfulRunScriptOutput() helper:
  - Detects null / undefined / empty / whitespace
  - Detects literal "null" / "undefined" / "" / ''
  - Detects empty JSON shells {} / []
  - Detects {x: null} / partial-extraction patterns (any null = retry)
  - Detects placeholder patterns via hasPlaceholderPattern

executePlan auto-complete branch hardened:
  - Old: auto-complete fires whenever lastRunScriptOutput is truthy
  - New: auto-complete fires only when isMeaningfulRunScriptOutput is true
  - Catches the literal "null" string bug that previously slipped through

executePlan runScript-empty fall-through:
  - When the last step is runScript and the output isn't meaningful, return
    deviated with a reason that names the failure AND points the per-action
    LLM at extractWithIndex (the Gen 10 recovery tool)
  - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the
    per-action loop has extractWithIndex available AND the cost cap bounds
    runaway recovery loops

Tests cherry-picked: 12 (all pass)
  - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts
  - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on
    {x:null}, declines on literal "null", positive control auto-completes
    on real values)

Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9
runScript-empty fall-through are mutually exclusive (different last-step
types). Both kept, ordered Gen 10 first then Gen 9.

Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9)
TypeScript clean. Boundaries clean.
5-rep matched same-day validation per CLAUDE.md rules #3 + #6:

  Gen 8 same-day 5-rep: 29/50 = 58%
  Gen 10 5-rep:         37/50 = 74%
  Delta:                +8 tasks (+16 percentage points)

Architectural wins (consistent across 3-rep AND 5-rep, same-day):
  - npm-package-downloads:    0/5 -> 5/5 (+5) extractWithIndex
  - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
  - github-pr-count:          4/5 -> 5/5 (+1)
  - stackoverflow-answer-count: 2/5 -> 3/5 (+1)

Cost analysis (matched same-day):
  - Raw cost: +59% ($0.017 -> $0.027)
  - Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
  - Death spirals: 0 (cost cap held; peak run $0.16)
  - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)

Failure modes that remain (Gen 10.1 candidates):
  - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
    Same in Gen 8, not a regression. Fixable via prompt, not architecture.
  - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
  - mdn/arxiv variance within Wilson 95% CI overlap.

Files:
  - .changeset/gen10-dom-index-extraction.md (honest writeup)
  - .evolve/progress.md (round 2 result + per-task table)
  - .evolve/current.json (status: round2_complete_promote)
  - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
@drewstone drewstone marked this pull request as ready for review April 9, 2026 02:14
@drewstone drewstone merged commit a12e466 into main Apr 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant