perf(tools): parallelize read-only tool calls and consolidate three loops#490
perf(tools): parallelize read-only tool calls and consolidate three loops#490TatsuKo-Tsukimi wants to merge 5 commits intodataelement:mainfrom
Conversation
Replace naive `conversation[-ctx_size:]` slicing with a walker that treats assistant.tool_calls and its matching role="tool" messages as one atomic block. Naive slicing can leave an orphan role="tool" at the head when the cut lands mid-pair — OpenAI rejects this with "No tool call found for function call output" (issue dataelement#446). New helper `services/history_window.truncate_by_message_count` walks tail backward, identifies blocks, and includes blocks whole or not at all. Orphan tools are silently dropped regardless of budget. Replaces head-only pop guard in: - app/api/websocket.py (web chat) - app/api/feishu.py (feishu channel) Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply path's hardcoded [-10:] is intentional). Tests: 15 covering empty input, budget edges, parallel tool_calls, multi-pair, head/mid orphans, realistic 60-message invariant. Addresses dataelement#446 failure mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b49ab53947
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Run parallel reads concurrently | ||
| parallel_outcomes: list[Any] = [] | ||
| if parallel_tcs: | ||
| parallel_outcomes = await asyncio.gather( | ||
| *[_run_one(tc, ctx) for tc in parallel_tcs], |
There was a problem hiding this comment.
Preserve tool execution order across mixed call types
Running all TOOLS_PARALLELIZABLE calls first and only then executing the serial bucket reorders side effects relative to the model-issued tool_calls sequence. In a mixed round like write_file followed by read_file, the read can now execute before the write and return stale/missing data even though the response messages are reassembled in original order. This is a behavior regression from the previous strictly sequential loop and can break tool chains that depend on in-round ordering.
Useful? React with 👍 / 👎.
| plaza_posts_made += 1 | ||
| elif tool_name == "plaza_add_comment": | ||
| if plaza_comments_made >= 2: | ||
| tool_result = "[BLOCKED] You have already made 2 comments this heartbeat. Do not comment again." | ||
| else: | ||
| tool_result = await execute_tool(tool_name, args, agent_id, agent_creator_id) | ||
| plaza_comments_made += 1 | ||
| else: | ||
| tool_result = await execute_tool(tool_name, args, agent_id, agent_creator_id) | ||
|
|
||
| llm_messages.append(LLMMessage( | ||
| role="tool", | ||
| tool_call_id=tc["id"], | ||
| content=str(tool_result), | ||
| )) | ||
| return "[BLOCKED] You have already made 2 comments this heartbeat. Do not comment again." | ||
| plaza_comments_made += 1 |
There was a problem hiding this comment.
Increment plaza quotas only after successful execution
The heartbeat plaza hook increments plaza_posts_made/plaza_comments_made before execute_tool runs, so a failed plaza_create_post or plaza_add_comment attempt still consumes quota and blocks retries in the same heartbeat tick. Previously counters were incremented only after a successful tool call, so this change can turn transient tool failures into hard false rate-limit blocks.
Useful? React with 👍 / 👎.
b49ab53 to
85ac090
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 85ac09002c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| outs = await asyncio.gather( | ||
| *[_run_one(tc, ctx) for tc in pending_parallel], | ||
| return_exceptions=True, |
There was a problem hiding this comment.
Serialize callback dispatch for parallel tool batches
execute_tool_calls now runs read-only tools via asyncio.gather, so each _run_one emits on_tool_call events in completion order rather than issued order. In the current web clients (frontend/src/pages/Chat.tsx and frontend/src/pages/AgentDetail.tsx), tool_call updates are reconciled by tool name (not call_id), so two same-name calls in one round (e.g. multiple read_file) can have their results/statuses attached to the wrong pending entry when finishes are reversed. This is a regression introduced by parallel batching and will misreport tool outputs in real multi-read turns.
Useful? React with 👍 / 👎.
Add `agents.context_window_tokens` field (default 50000) and a new
`truncate_by_token_budget` helper that bounds in-context history by both
estimated token cost (primary) and message count (safety cap), preserving
assistant.tool_calls ↔ role=tool pairs intact via the same walker as
truncate_by_message_count.
Why: message-count alone is a wildly variable proxy for token cost — one
50KB tool result eats more context than 100 short user messages. Token
budget gives predictable behavior across heterogeneous traffic; message
cap remains as a safety net against pathological tiny-message floods.
Changes:
- models/agent.py: + context_window_tokens (Integer, default=50000)
+ DEFAULT_CONTEXT_WINDOW_TOKENS constant
- schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000)
- alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS)
- services/history_window.py: + truncate_by_token_budget, refactored
common walker, JSON-serialized char->token estimate via existing
estimate_tokens_from_chars (chars/3 — overestimates safely)
- api/websocket.py: pass tok_budget to helper, raise DB load to
max(ctx_size, 500) so helper has room to choose
- api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
- frontend: AgentDetail Settings slider + i18n + types
10 new tests covering token-budget mode (huge-message dropped, both-bounds
interaction, atomic pair preservation, orphan defense). 25/25 pass.
Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use
DB-level message-count limit only — they don't get token awareness in
this PR but won't crash. Migrating them is follow-up scope.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tool calls returning ~50 KB of search hits or PDF extracts used to enter
the in-context history verbatim, burning tokens for content the model
samples one slice of. New `services/tool_result_truncation.py` spills
oversized payloads to `agent_data/<agent_id>/_tool_results/<call_id>.txt`
and replaces the in-context body with a head excerpt + a marker pointing
the model at `read_file` for retrieval.
Threshold: 4000 estimated tokens (~12 KB at chars/3) — soft-start to give
the model time to learn the read_file follow-up; can tighten to ~2000
once telemetry confirms reliable use of the marker.
Smart head:
- JSON-shape responses with a known array key (results/items/data/
entries/hits/documents) keep metadata + first 5 items + a synthetic
_truncated_<key>_count field, so search/list responses stay useful.
- Otherwise plain head-cut (preferring a line boundary when one is
near the cut point).
- Vision-injected list payloads (image_data markers) pass through.
Sandbox: spill files live under each agent's existing AGENT_DATA_DIR
sandbox; cross-agent leak is prevented by the existing read_file boundary.
Applied at three sites to prevent drift:
- llm/caller.py _process_tool_call (call_llm path)
- llm/caller.py call_agent_llm_with_tools inner loop
- services/heartbeat.py tool loop
agent_context.py rule 3 augmented with a paragraph teaching the model to
recognize the marker and use read_file for full content rather than
fabricating from prior knowledge.
16 new unit tests covering pass-through, spill, utf-8 round-trip,
threshold boundary, JSON-shape preservation, write-failure fallback, and
a realistic 50-result jina_search end-to-end. 41/41 total pass.
Note: 3 tool loops still exist (caller.py:432, caller.py:763,
heartbeat.py:335). Drift prevention is partial here — PR 3 consolidates
them into a single execute_tool_calls helper.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oops Add `services/tool_loop.py` as single source of truth for tool execution. `execute_tool_calls` runs whitelisted reads concurrently via asyncio.gather; everything else serializes in original order. Tool result messages come back in tool_calls[] order regardless of completion order (provider-strict). Replaces three drifted loop copies: - llm/caller.py:_process_tool_call (call_llm) - llm/caller.py inner loop in call_agent_llm_with_tools - services/heartbeat.py tool loop heartbeat.py:_TOOLS_REQUIRING_ARGS was a strict superset of caller.py's (included web_search/jina_search/jina_read) — now unified. TOOLS_PARALLELIZABLE is conservative — only stateless reads (read_file, list_files, search_jina, web_fetch, etc.). A2A tools, write/delete, external messaging, trigger lifecycle, plaza writes, execute_code stay serial, preserving relationship checks, dedup, max_fires invariants. Heartbeat plaza rate limiter (1 post / 2 comments per tick) becomes a pre_execute_hook closure. Plaza tools are non-parallelizable so counters remain race-free. 14 new tests: static A2A safety isdisjoint, reverse-latency order preservation, parallel wall-clock concurrency, mixed parallel+serial, exception isolation, empty-args guard, pre-execute hook short-circuit, on_tool_call events, callback exception robustness. 55/55 pass. Performance: a 5-read research turn now completes in max(latencies) instead of sum(latencies) — typically 3-5x speedup on multi-tool rounds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
85ac090 to
3ebdaab
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ebdaab0a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ws_path is not None: | ||
| from app.services.tool_result_truncation import maybe_truncate_tool_result | ||
| tool_content = maybe_truncate_tool_result( | ||
| tool_content, call_id=tc["id"], agent_workspace=ws_path, | ||
| ) |
There was a problem hiding this comment.
Bypass truncation when reading spill files
This unconditionally re-applies maybe_truncate_tool_result to all tool outputs, including read_file responses. When the model follows the truncation marker and calls read_file("_tool_results/<call_id>.txt"), large files are truncated again here and replaced with a new marker, so the promised “full output” is never actually returned in-context. This is especially problematic for single-line blobs (e.g., minified JSON), where read_file’s line-based paging cannot target subsections, so repeated reads can never recover the missing content.
Useful? React with 👍 / 👎.
The spill-marker design assumes the model can recover oversized tool output by calling read_file on the spill path. But read_file's own output was being passed back through maybe_truncate_tool_result, so a 60KB spill would be re-truncated, the spill file overwritten with its own marker, and the model could never get the original content back — infinite read loop with no progress. Bypass list (read_file, read_document) lives next to TOOLS_PARALLELIZABLE in tool_loop. Both already support offset/limit pagination, so an oversized retrieval is the model's responsibility to page, not ours to silently truncate. Adds three regression tests: - read_file output passes through unchanged, no spill written - read_document same - jina_search (control) still spills and returns the marker Reported by Codex auto-review on dataelement#490.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1979c02190
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "status": "done", | ||
| "result": result, | ||
| "reasoning_content": ctx.full_reasoning_content, |
There was a problem hiding this comment.
Emit truncated tool payload in done callback
This callback publishes the raw result even after tool_content may have been replaced with a [truncated ... _tool_results/<call_id>.txt] marker. In our websocket flow, the done payload is what gets persisted to chat history, so oversized tool outputs are later replayed as a clipped fragment without the spill-file pointer, which diverges from what the model saw in-turn and prevents later rounds from recovering the full saved output.
Useful? React with 👍 / 👎.
| spill_root.mkdir(parents=True, exist_ok=True) | ||
| full_path.write_text(tool_content, encoding="utf-8") |
There was a problem hiding this comment.
Add retention policy for spilled tool-result files
Each oversized tool response is written to a new _tool_results/<call_id>.txt file, and call IDs are per-invocation, so this path grows monotonically. There is no corresponding cleanup in this change, which means long-lived agents with frequent large tool outputs can steadily consume workspace disk until writes/read-modify tools start failing due to exhausted storage.
Useful? React with 👍 / 👎.
Summary
Add
services/tool_loop.pyas single source of truth for tool execution.execute_tool_callsruns whitelisted reads concurrently viaasyncio.gather; everything else serializes in original order. Tool result messages come back intool_calls[]order regardless of completion order (provider-strict).Replaces three drifted loop copies:
llm/caller.py:_process_tool_call(call_llm)llm/caller.pyinner loop incall_agent_llm_with_toolsservices/heartbeat.pytool loopheartbeat.py:_TOOLS_REQUIRING_ARGSwas a strict superset ofcaller.py's (includedweb_search/jina_search/jina_read) — now unified.Safety: A2A serial preserved
TOOLS_PARALLELIZABLEis conservative — only stateless reads (read_file,list_files,read_document,search_jina/jina_search/jina_read,web_fetch/web_search,extract_pdf,list_triggers).A2A tools (
send_message_to_agent,send_file_to_agent), filesystem writes (write_file/delete_file/edit_file), external messaging, trigger lifecycle, plaza writes,execute_codeall stay serial — preserves relationship checks, dedup invariants, max_fires guarantees. Staticisdisjointtest enforces this.Heartbeat plaza rate limiter (1 post / 2 comments per tick) becomes a
pre_execute_hookclosure. Plaza tools are non-parallelizable so counters remain race-free.Performance
A 5-read research turn now completes in
max(latencies)instead ofsum(latencies)— typically 3-5x speedup on multi-tool rounds.Tests
14 new tests:
Stack
PR 4/4 of context-budget hygiene series.
This is the last PR in the series. Cumulative diff against
mainshows all 4 PRs' commits; review only the latest commit's delta until #487-#489 merge.Test plan
pytest backend/tests/test_tool_loop.py— 14/14 passTOOLS_PARALLELIZABLEwhitelist is conservative enough — any read-shaped tool with hidden side effects we've missed?on_tool_callmay fire from concurrent coroutines, verify frontend handles out-of-order events keyed bycall_idLines net: caller.py -130, heartbeat.py -29. New helper file (~250 lines) + tests (~280 lines). Despite adding the helper, removed three drifted copies.
🤖 Generated with Claude Code