Gen 10 — DOM index extraction + bigger snapshot + cost cap#60
Merged
Gen 10 — DOM index extraction + bigger snapshot + cost cap#60
Conversation
Three coordinated changes that ship together as Gen 10:
A) extractWithIndex action — pick-by-content over pick-by-selector
New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
that returns a numbered, text-rich list of every visible element matching
the query. The agent picks elements by index in the next turn.
This is the architectural fix Gen 9 was missing: instead of asking the LLM
to write a precise CSS selector for data it hasn't seen yet (the failure
mode on npm/mdn/python-docs), the wide query finds candidates and the
response shows actual textContent so the LLM can pick by content match.
Wired into:
- src/types.ts (ExtractWithIndexAction type, added to Action union)
- src/brain/index.ts (validateAction parser, system prompt, planner prompt,
data-extraction rule #25 explaining when to prefer extractWithIndex over
runScript on extraction tasks)
- src/drivers/extract-with-index.ts (browser-side query helper, returns
{index, tag, text, attributes, selector} for each visible match, capped
at 80 matches)
- src/drivers/playwright.ts (driver.execute dispatch, returns formatted
output as data so executePlan can capture it like runScript)
- src/runner/runner.ts (per-action loop handler with feedback injection,
executePlan capture into lastExtractOutput, plan-ends-with-extract
fall-through to per-action loop with the match list as REPLAN context)
- src/supervisor/policy.ts (action signature for stuck-detection)
C) Bigger snapshot + content-line preservation
src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
paragraph content lines (which previously got dropped as "decorative" by
the interactive-only filter). These are exactly the lines that carry the
data agents need on MDN/Python docs/W3C spec/arxiv pages.
Budgets raised:
- Default budgetSnapshot cap: 16k → 24k chars
- Decide() new-page snapshot: 16k → 24k
- Planner snapshot: 12k → 24k (planner is the most important caller for
extraction tasks because it writes the runScript on the first observation)
Same-page snapshot stays at 8k (after the LLM has already seen the page).
Empirical verification: probed Playwright's locator.ariaSnapshot() output
on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
— confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
content. The bug was in the budgetSnapshot filter dropping them, not in
the snapshot pipeline missing them.
Cost cap (mandatory safety net for any iteration-based mechanism)
src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
top of every loop iteration (before the next LLM call) and returns
`cost_cap_exceeded` if exceeded.
Calibration:
- Gen 8 real-web mean: ~6k tokens (well under 100k)
- Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
- Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)
100k = above any normal case I've measured, well below any death spiral.
Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
$0.32/173k tokens) within 5–8 turns of futility instead of running for
the full case timeout.
Tests: 18 new (981 total, +18 from baseline)
- tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
priority bucket, paragraph handling)
- tests/extract-with-index.test.ts: 13 (browser-side query, contains
filter, hidden element skipping, invalid selector graceful failure,
stable selector building, formatter, parser via Brain.parse)
- tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
- tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
with match list, cost cap exhaustion)
Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.
Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md
…ll-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean.
5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: DRAFT — awaiting real-web gauntlet validation
This is the capability change Gen 9 was missing. Three coordinated changes ship together:
extractWithIndexaction — pick-by-content over pick-by-selector<dl>/<dt>/<dd>/<code>and the old budgetSnapshot filter dropped them as "decorative"cost_cap_exceededabortWhy this is NOT another Gen 9
Gen 9 failed because it was a mechanism change without a capability change: more turns for the same LLM picking the same wrong selector. Gen 10 is a capability change — the LLM gets new information (a numbered, text-rich element index) that it didn't have before. Browser-use's per-action loop wins on the failing tasks because of exactly this mechanism.
What ships
A — extractWithIndex
{action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}[0] <p> data-testid=\"downloads\" selector: [data-testid=\"downloads\"] text: Weekly downloads: 26,543,821C — bigger snapshot
Cost cap
Tests
981 → 999 passing (+18 net new):
Gates
Lesson from Gen 9 baked in
```
Per CLAUDE.md rule #6: NO PROMOTION until 5-rep validation shows pass-rate
gain ≥+2 AND no per-task cost regression >2x AND no death-spiral runs.
```
This PR will NOT merge until the gauntlet data is in. If 3-rep doesn't show movement, the build still has value (the cost cap is a real safety net for any future iteration mechanism, the snapshot filter fix is independently useful, and extractWithIndex is a real primitive). Even in the worst case it's a strict superset of Gen 8 behavior — extractWithIndex never fires unless the LLM emits it.
What 3-rep validation will reveal
The 4 tasks I expect to move:
` content now visible in the snapshot AND extractWithIndex with `contains:'flatMap'` returns the signature directly` content in the snapshotThe risk: tasks where extractWithIndex returns 80 matches and the LLM picks the wrong one.
Refs