Gen 10 — DOM index extraction + bigger snapshot + cost cap by drewstone · Pull Request #60 · tangle-network/browser-agent-driver

drewstone · 2026-04-09T01:25:17Z

Status: DRAFT — awaiting real-web gauntlet validation

This is the capability change Gen 9 was missing. Three coordinated changes ship together:

	change	what it solves
A	`extractWithIndex` action — pick-by-content over pick-by-selector	npm/mdn/python-docs failures where the LLM can't write a precise selector for data it hasn't seen
C	bigger snapshot + content line preservation (term/definition/code/pre/paragraph)	MDN/Python docs/W3C spec/arxiv pages where the data lives in `<dl>/<dt>/<dd>/<code>` and the old budgetSnapshot filter dropped them as "decorative"
cost cap	100K-token per-case hard cap with `cost_cap_exceeded` abort	the Gen 9 reddit-style death spirals (132K-173K tokens / case)

Why this is NOT another Gen 9

Gen 9 failed because it was a mechanism change without a capability change: more turns for the same LLM picking the same wrong selector. Gen 10 is a capability change — the LLM gets new information (a numbered, text-rich element index) that it didn't have before. Browser-use's per-action loop wins on the failing tasks because of exactly this mechanism.

What ships

A — extractWithIndex

New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
Returns numbered list: [0] <p> data-testid=\"downloads\" selector: [data-testid=\"downloads\"] text: Weekly downloads: 26,543,821
Wide query + content filter beats narrow selector. The LLM picks by index in the next turn.
Wired into per-action loop (handler at `runner.ts`), executePlan (capture into `lastExtractOutput`, fall-through with match list), planner system prompt, validateAction parser, supervisor signature.
Helper at `src/drivers/extract-with-index.ts` (browser-side query, hidden-element skipping, stable-selector building, 80-match cap).

C — bigger snapshot

`budgetSnapshot` filter now preserves `term`/`definition`/`code`/`pre`/`paragraph` content lines
Default cap raised 16k → 24k chars
Planner cap raised 12k → 24k (planner is the most important caller for extraction tasks — it writes the runScript on first observation)
Same-page snapshot stays at 8k (LLM has already seen the page)
Empirical verification: Playwright DOES emit term/definition/code lines with text content. The bug was the filter, not the snapshot pipeline.

Cost cap

`RunState.totalTokensUsed` accumulator + `tokenBudget` (default 100K, override via `Scenario.tokenBudget` or `BAD_TOKEN_BUDGET` env)
`isTokenBudgetExhausted` checked at top of every loop iteration before next LLM call
Returns `success: false, reason: 'cost_cap_exceeded: ...'` so the bench harness reports it cleanly
Calibration: Gen 8 real-web ~6K, tier 1 form-multistep 60K, Gen 9 death spirals 132K-173K → 100K is well above normal max, well below death spirals

Tests

981 → 999 passing (+18 net new):

`tests/budget-snapshot.test.ts` — 6 (filter preservation, content lines, priority bucket, paragraph handling)
`tests/extract-with-index.test.ts` — 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful fail, stable selector, formatter, parser via Brain.parse)
`tests/run-state.test.ts` — 7 in 'Gen 10 cost cap' describe (default, env override, accumulator, exhaustion threshold)
`tests/runner-execute-plan.test.ts` — 2 (extractWithIndex deviation with match list, cost cap exhaustion)

Gates

✅ TypeScript clean (`pnpm exec tsc --noEmit`)
✅ Boundaries clean (`pnpm check:boundaries`)
✅ Full test suite (`pnpm test`) — 981/981 → growing to 999/999
✅ Tier1 deterministic gate PASSED
⏳ 3-rep real-web gauntlet — RUNNING
⏳ 5-rep promotion gate — gated on 3-rep wins

Lesson from Gen 9 baked in

```
Per CLAUDE.md rule #6: NO PROMOTION until 5-rep validation shows pass-rate
gain ≥+2 AND no per-task cost regression >2x AND no death-spiral runs.
```

This PR will NOT merge until the gauntlet data is in. If 3-rep doesn't show movement, the build still has value (the cost cap is a real safety net for any future iteration mechanism, the snapshot filter fix is independently useful, and extractWithIndex is a real primitive). Even in the worst case it's a strict superset of Gen 8 behavior — extractWithIndex never fires unless the LLM emits it.

What 3-rep validation will reveal

The 4 tasks I expect to move:

mdn-array-flatmap (2/3 → 3/3): `
` content now visible in the snapshot AND extractWithIndex with `contains:'flatMap'` returns the signature directly


python-docs-method-signature (2/3 → 3/3): same fix — `` content in the snapshot

npm-package-downloads (1/3 → 2-3/3): extractWithIndex with `contains:'downloads'` reads the XHR-loaded text directly
w3c-html-spec-find-element (3/3 → 3/3): bigger snapshot helps long-document navigation


The risk: tasks where extractWithIndex returns 80 matches and the LLM picks the wrong one.
Refs

Pursuit doc: `.evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md`
Gen 9 retrospective: same doc
Branch: `gen10-dom-index-extraction`

Three coordinated changes that ship together as Gen 10: A) extractWithIndex action — pick-by-content over pick-by-selector New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'} that returns a numbered, text-rich list of every visible element matching the query. The agent picks elements by index in the next turn. This is the architectural fix Gen 9 was missing: instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs), the wide query finds candidates and the response shows actual textContent so the LLM can pick by content match. Wired into: - src/types.ts (ExtractWithIndexAction type, added to Action union) - src/brain/index.ts (validateAction parser, system prompt, planner prompt, data-extraction rule #25 explaining when to prefer extractWithIndex over runScript on extraction tasks) - src/drivers/extract-with-index.ts (browser-side query helper, returns {index, tag, text, attributes, selector} for each visible match, capped at 80 matches) - src/drivers/playwright.ts (driver.execute dispatch, returns formatted output as data so executePlan can capture it like runScript) - src/runner/runner.ts (per-action loop handler with feedback injection, executePlan capture into lastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN context) - src/supervisor/policy.ts (action signature for stuck-detection) C) Bigger snapshot + content-line preservation src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/ paragraph content lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages. Budgets raised: - Default budgetSnapshot cap: 16k → 24k chars - Decide() new-page snapshot: 16k → 24k - Planner snapshot: 12k → 24k (planner is the most important caller for extraction tasks because it writes the runScript on the first observation) Same-page snapshot stays at 8k (after the LLM has already seen the page). Empirical verification: probed Playwright's locator.ariaSnapshot() output on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl> — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text content. The bug was in the budgetSnapshot filter dropping them, not in the snapshot pipeline missing them. Cost cap (mandatory safety net for any iteration-based mechanism) src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default 100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the top of every loop iteration (before the next LLM call) and returns `cost_cap_exceeded` if exceeded. Calibration: - Gen 8 real-web mean: ~6k tokens (well under 100k) - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom) - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted) 100k = above any normal case I've measured, well below any death spiral. Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4: $0.32/173k tokens) within 5–8 turns of futility instead of running for the full case timeout. Tests: 18 new (981 total, +18 from baseline) - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines, priority bucket, paragraph handling) - tests/extract-with-index.test.ts: 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful failure, stable selector building, formatter, parser via Brain.parse) - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation with match list, cost cap exhaustion) Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS, Tier1 deterministic gate PASSED. Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md

…ll-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean.

5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)

drewstone added 3 commits April 8, 2026 18:23

drewstone marked this pull request as ready for review April 9, 2026 02:14

drewstone merged commit a12e466 into main Apr 9, 2026
5 checks passed

github-actions bot mentioned this pull request Apr 9, 2026

Release: version packages #61

Open

drewstone mentioned this pull request Apr 9, 2026

Gen 11 — master comparison truth table (cross-framework + WebVoyager + multi-model) #62

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gen 10 — DOM index extraction + bigger snapshot + cost cap#60

Gen 10 — DOM index extraction + bigger snapshot + cost cap#60
drewstone merged 3 commits intomainfrom
gen10-dom-index-extraction

drewstone commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 9, 2026

Status: DRAFT — awaiting real-web gauntlet validation

Why this is NOT another Gen 9

What ships

A — extractWithIndex

C — bigger snapshot

Cost cap

Tests

Gates

Lesson from Gen 9 baked in

What 3-rep validation will reveal

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant