Release: version packages by github-actions[bot] · Pull Request #61 · tangle-network/browser-agent-driver

github-actions · 2026-04-09T04:07:54Z

This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.

Releases

@tangle-network/browser-agent-driver@0.23.0

Minor Changes

#60 a12e466 Thanks @drewstone! - Gen 10 — DOM index extraction (extractWithIndex) + bigger snapshot + content-line preservation + cost cap. +8 tasks (+16 pp) on the real-web gauntlet vs same-day Gen 8 baseline, validated at 5-rep per CLAUDE.md rules #3 and #6.

Honest 5-rep numbers (matched same-day baseline)

metric	Gen 8 same-day 5-rep	Gen 10 5-rep	Δ
pass rate	29/50 = 58%	37/50 = 74%	+8 tasks (+16 pp)
mean wall-time	9.4s	12.6s	+3.2s (+34%)
mean cost	$0.0171	$0.0272	+$0.010 (+59%)
cost per pass	$0.029	$0.037	+28%
death spirals	0	0	✓ cost cap held
peak run cost	$0.04	$0.16 (wikipedia recovery loop)	regression noted

Key wins (5-rep, same-day):

task	Gen 8	Gen 10	Δ
npm-package-downloads	0/5	5/5	+5 ⭐⭐⭐
w3c-html-spec-find-element	2/5	5/5	+3 ⭐⭐
github-pr-count	4/5	5/5	+1
stackoverflow-answer-count	2/5	3/5	+1
hn / mdn / reddit / python-docs	parity (5/5, 2/5, 5/5, 3/5)	parity	0
wikipedia / arxiv	3/5	2/5	-1 (Wilson 95% CI overlap, variance)

Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (Gen 9.1 had 3/5 at $0.25-$0.32 death spirals).

What ships

A — `extractWithIndex` action (the capability change)

New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'} returns a numbered list of every visible element matching query, each with full textContent + key attributes + a stable selector. The agent picks elements by index in the next turn.

This is the architectural fix Gen 9 was missing. Instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs/w3c), the wide query finds candidates and the response shows actual textContent so the LLM picks by content match. Pick-by-content beats pick-by-selector on every page where the planner couldn't see the data at plan time.

Wired into:

src/types.ts — ExtractWithIndexAction type, added to Action union
src/brain/index.ts — validateAction parser, system prompt, planner prompt, data-extraction rule #25 explaining when to prefer extractWithIndex over runScript
src/drivers/extract-with-index.ts — browser-side query helper (visibility check, stable selector building, hidden-element skipping, 80-match cap)
src/drivers/playwright.ts — driver dispatch returns formatted output as data so executePlan can capture it
src/runner/runner.ts — per-action loop handler with feedback injection, executePlan capture into lastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN context
src/supervisor/policy.ts — action signature for stuck-detection

C — Bigger snapshot + content-line preservation

src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/paragraph content lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages.

Budgets raised:

Default budgetSnapshot cap: 16k → 24k chars
Decide() new-page snapshot: 16k → 24k
Planner snapshot: 12k → 24k (the planner is the most important caller for extraction tasks because it writes the runScript on the first observation)

Same-page snapshot stays at 8k (after the LLM has already seen the page).

Empirical verification: probed Playwright's locator.ariaSnapshot() output on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl> — confirmed Playwright DOES emit term/definition/code lines with text content. The bug was the filter dropping them, not the snapshot pipeline missing them.

Cost cap (mandatory safety net)

src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default 100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the top of every loop iteration (before the next LLM call) and returns success: false, reason: 'cost_cap_exceeded: ...' if exceeded.

Calibration:

Gen 8 real-web mean: ~6k tokens (well under 100k)
Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)

100k = above any normal case observed, well below any death spiral. Result: zero cost cap hits in 50 runs. Reddit Gen 9.1 regression eliminated.

Cherry-picked Gen 9 helper (safe in Gen 10)

isMeaningfulRunScriptOutput() helper detects when a runScript output is too null/empty/placeholder to be a valid extraction. The original Gen 9 PR (#59) was closed because the LLM-iteration recovery loop didn't move pass rate AND introduced cost regressions. In Gen 10 the same code is safe because:

Cost cap (100k) bounds any death spiral
Per-action loop has extractWithIndex — when the deviation reason mentions "runScript returned no meaningful output", rule #25 directs the LLM to extractWithIndex instead of retrying the same wrong selector

The helper hardens the executePlan auto-complete branch (rejects "null", {x:null}, etc.) and gates a runScript-empty fall-through that points the per-action LLM at extractWithIndex.

Tests

993/993 passing (+12 net new vs Gen 8):

tests/budget-snapshot.test.ts — 6 (filter preservation, content lines, priority bucket, paragraph handling)
tests/extract-with-index.test.ts — 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful fail, stable selector, formatter, parser via Brain.parse)
tests/run-state.test.ts — 7 in 'Gen 10 cost cap' describe (default, env override, accumulator, exhaustion threshold)
tests/runner-execute-plan.test.ts — 14 new (extractWithIndex deviation with match list, cost cap exhaustion, plus 12 cherry-picked Gen 9 fall-through tests)

Gates

✅ TypeScript clean (pnpm exec tsc --noEmit)
✅ Boundaries clean (pnpm check:boundaries)
✅ Full test suite (pnpm test) — 993/993
✅ Tier1 deterministic gate PASSED
✅ 5-rep real-web gauntlet PASSED — +8 tasks vs same-day baseline
✅ Same-day matched baseline (rule #3)
✅ ≥5 reps for pass-rate claim (rule #6)
✅ Cost regression honestly noted (+28% per pass, +59% raw)

Honest assessment

What this PR is: a real architectural improvement that adds a new capability (DOM index extraction) and removes a known failure mode (recovery loop death spirals).

What it isn't: a free win. Cost is +59% raw / +28% per-pass. Wall-time is +34%. Some tasks still fail (wikipedia oracle compliance, mdn/arxiv variance).

What the data says: Gen 10 is unambiguously better than Gen 8 at the same model and same conditions. The +8 task gain is well outside Wilson 95% CI overlap. The architectural changes (extractWithIndex, bigger snapshot) deliver exactly the wins they were designed for (npm 0→5, w3c 2→5).

What Gen 10.1 should fix:

Wikipedia oracle compliance: prompt tweak to make the LLM emit {"year":1815} not '1815'
Supervisor extra-context bloat on stuck-detection turns (cap the directive size to ~5k tokens)
mdn / arxiv variance: investigate whether the contains-filter on extractWithIndex needs better prompting

github-actions bot force-pushed the changeset-release/main branch 22 times, most recently from d11ecc7 to fd7b5a3 Compare April 11, 2026 08:22

chore: version packages

f355a8c

github-actions bot force-pushed the changeset-release/main branch from fd7b5a3 to f355a8c Compare April 11, 2026 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: version packages#61

Release: version packages#61
github-actions[bot] wants to merge 1 commit intomainfrom
changeset-release/main

github-actions bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

github-actions bot commented Apr 9, 2026

Releases

@tangle-network/browser-agent-driver@0.23.0

Minor Changes

Honest 5-rep numbers (matched same-day baseline)

What ships

A — extractWithIndex action (the capability change)

C — Bigger snapshot + content-line preservation

Cost cap (mandatory safety net)

Cherry-picked Gen 9 helper (safe in Gen 10)

Tests

Gates

Honest assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

A — `extractWithIndex` action (the capability change)