Skip to content

perf(tools): parallelize read-only tool calls and consolidate three loops#490

Open
TatsuKo-Tsukimi wants to merge 5 commits intodataelement:mainfrom
TatsuKo-Tsukimi:perf/parallel-tool-execution
Open

perf(tools): parallelize read-only tool calls and consolidate three loops#490
TatsuKo-Tsukimi wants to merge 5 commits intodataelement:mainfrom
TatsuKo-Tsukimi:perf/parallel-tool-execution

Conversation

@TatsuKo-Tsukimi
Copy link
Copy Markdown

Summary

Add services/tool_loop.py as single source of truth for tool execution. execute_tool_calls runs whitelisted reads concurrently via asyncio.gather; everything else serializes in original order. Tool result messages come back in tool_calls[] order regardless of completion order (provider-strict).

Replaces three drifted loop copies:

  • llm/caller.py:_process_tool_call (call_llm)
  • llm/caller.py inner loop in call_agent_llm_with_tools
  • services/heartbeat.py tool loop

heartbeat.py:_TOOLS_REQUIRING_ARGS was a strict superset of caller.py's (included web_search/jina_search/jina_read) — now unified.

Safety: A2A serial preserved

TOOLS_PARALLELIZABLE is conservative — only stateless reads (read_file, list_files, read_document, search_jina/jina_search/jina_read, web_fetch/web_search, extract_pdf, list_triggers).

A2A tools (send_message_to_agent, send_file_to_agent), filesystem writes (write_file/delete_file/edit_file), external messaging, trigger lifecycle, plaza writes, execute_code all stay serial — preserves relationship checks, dedup invariants, max_fires guarantees. Static isdisjoint test enforces this.

Heartbeat plaza rate limiter (1 post / 2 comments per tick) becomes a pre_execute_hook closure. Plaza tools are non-parallelizable so counters remain race-free.

Performance

A 5-read research turn now completes in max(latencies) instead of sum(latencies) — typically 3-5x speedup on multi-tool rounds.

Tests

14 new tests:

  • Static A2A safety isdisjoint check
  • Reverse-latency order preservation (tools complete in reverse but output order matches input)
  • Parallel wall-clock concurrency (5 reads × 80ms ≤ 200ms)
  • Mixed parallel + serial
  • Exception isolation (one tool raises, siblings unaffected)
  • Empty-args guard
  • Pre-execute hook short-circuit (heartbeat plaza rate limiter)
  • on_tool_call running+done events
  • Callback exception robustness

Stack

PR 4/4 of context-budget hygiene series.

This is the last PR in the series. Cumulative diff against main shows all 4 PRs' commits; review only the latest commit's delta until #487-#489 merge.

Test plan

  • pytest backend/tests/test_tool_loop.py — 14/14 pass
  • All hygiene unit tests — 56/56 pass (16 + 26 + 16 + 14 across 4 PRs)
  • Reviewer: confirm TOOLS_PARALLELIZABLE whitelist is conservative enough — any read-shaped tool with hidden side effects we've missed?
  • Reviewer: callback re-entrancy — on_tool_call may fire from concurrent coroutines, verify frontend handles out-of-order events keyed by call_id
  • Reviewer: heartbeat plaza counter increments before execute (instead of after) — slightly more conservative; acceptable?

Lines net: caller.py -130, heartbeat.py -29. New helper file (~250 lines) + tests (~280 lines). Despite adding the helper, removed three drifted copies.

🤖 Generated with Claude Code

Replace naive `conversation[-ctx_size:]` slicing with a walker that treats
assistant.tool_calls and its matching role="tool" messages as one atomic
block. Naive slicing can leave an orphan role="tool" at the head when the
cut lands mid-pair — OpenAI rejects this with "No tool call found for
function call output" (issue dataelement#446).

New helper `services/history_window.truncate_by_message_count` walks tail
backward, identifies blocks, and includes blocks whole or not at all.
Orphan tools are silently dropped regardless of budget.

Replaces head-only pop guard in:
  - app/api/websocket.py (web chat)
  - app/api/feishu.py (feishu channel)

Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply
path's hardcoded [-10:] is intentional).

Tests: 15 covering empty input, budget edges, parallel tool_calls,
multi-pair, head/mid orphans, realistic 60-message invariant.

Addresses dataelement#446 failure mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b49ab53947

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/app/services/tool_loop.py Outdated
Comment on lines +137 to +141
# Run parallel reads concurrently
parallel_outcomes: list[Any] = []
if parallel_tcs:
parallel_outcomes = await asyncio.gather(
*[_run_one(tc, ctx) for tc in parallel_tcs],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve tool execution order across mixed call types

Running all TOOLS_PARALLELIZABLE calls first and only then executing the serial bucket reorders side effects relative to the model-issued tool_calls sequence. In a mixed round like write_file followed by read_file, the read can now execute before the write and return stale/missing data even though the response messages are reassembled in original order. This is a behavior regression from the previous strictly sequential loop and can break tool chains that depend on in-round ordering.

Useful? React with 👍 / 👎.

Comment on lines +335 to +339
plaza_posts_made += 1
elif tool_name == "plaza_add_comment":
if plaza_comments_made >= 2:
tool_result = "[BLOCKED] You have already made 2 comments this heartbeat. Do not comment again."
else:
tool_result = await execute_tool(tool_name, args, agent_id, agent_creator_id)
plaza_comments_made += 1
else:
tool_result = await execute_tool(tool_name, args, agent_id, agent_creator_id)

llm_messages.append(LLMMessage(
role="tool",
tool_call_id=tc["id"],
content=str(tool_result),
))
return "[BLOCKED] You have already made 2 comments this heartbeat. Do not comment again."
plaza_comments_made += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Increment plaza quotas only after successful execution

The heartbeat plaza hook increments plaza_posts_made/plaza_comments_made before execute_tool runs, so a failed plaza_create_post or plaza_add_comment attempt still consumes quota and blocks retries in the same heartbeat tick. Previously counters were incremented only after a successful tool call, so this change can turn transient tool failures into hard false rate-limit blocks.

Useful? React with 👍 / 👎.

@TatsuKo-Tsukimi TatsuKo-Tsukimi force-pushed the perf/parallel-tool-execution branch from b49ab53 to 85ac090 Compare April 27, 2026 08:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85ac09002c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +140 to +142
outs = await asyncio.gather(
*[_run_one(tc, ctx) for tc in pending_parallel],
return_exceptions=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize callback dispatch for parallel tool batches

execute_tool_calls now runs read-only tools via asyncio.gather, so each _run_one emits on_tool_call events in completion order rather than issued order. In the current web clients (frontend/src/pages/Chat.tsx and frontend/src/pages/AgentDetail.tsx), tool_call updates are reconciled by tool name (not call_id), so two same-name calls in one round (e.g. multiple read_file) can have their results/statuses attached to the wrong pending entry when finishes are reversed. This is a regression introduced by parallel batching and will misreport tool outputs in real multi-read turns.

Useful? React with 👍 / 👎.

TatsuKo-Tsukimi and others added 3 commits April 27, 2026 17:51
Add `agents.context_window_tokens` field (default 50000) and a new
`truncate_by_token_budget` helper that bounds in-context history by both
estimated token cost (primary) and message count (safety cap), preserving
assistant.tool_calls ↔ role=tool pairs intact via the same walker as
truncate_by_message_count.

Why: message-count alone is a wildly variable proxy for token cost — one
50KB tool result eats more context than 100 short user messages. Token
budget gives predictable behavior across heterogeneous traffic; message
cap remains as a safety net against pathological tiny-message floods.

Changes:
  - models/agent.py: + context_window_tokens (Integer, default=50000)
    + DEFAULT_CONTEXT_WINDOW_TOKENS constant
  - schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000)
  - alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS)
  - services/history_window.py: + truncate_by_token_budget, refactored
    common walker, JSON-serialized char->token estimate via existing
    estimate_tokens_from_chars (chars/3 — overestimates safely)
  - api/websocket.py: pass tok_budget to helper, raise DB load to
    max(ctx_size, 500) so helper has room to choose
  - api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
  - frontend: AgentDetail Settings slider + i18n + types

10 new tests covering token-budget mode (huge-message dropped, both-bounds
interaction, atomic pair preservation, orphan defense). 25/25 pass.

Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use
DB-level message-count limit only — they don't get token awareness in
this PR but won't crash. Migrating them is follow-up scope.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tool calls returning ~50 KB of search hits or PDF extracts used to enter
the in-context history verbatim, burning tokens for content the model
samples one slice of. New `services/tool_result_truncation.py` spills
oversized payloads to `agent_data/<agent_id>/_tool_results/<call_id>.txt`
and replaces the in-context body with a head excerpt + a marker pointing
the model at `read_file` for retrieval.

Threshold: 4000 estimated tokens (~12 KB at chars/3) — soft-start to give
the model time to learn the read_file follow-up; can tighten to ~2000
once telemetry confirms reliable use of the marker.

Smart head:
  - JSON-shape responses with a known array key (results/items/data/
    entries/hits/documents) keep metadata + first 5 items + a synthetic
    _truncated_<key>_count field, so search/list responses stay useful.
  - Otherwise plain head-cut (preferring a line boundary when one is
    near the cut point).
  - Vision-injected list payloads (image_data markers) pass through.

Sandbox: spill files live under each agent's existing AGENT_DATA_DIR
sandbox; cross-agent leak is prevented by the existing read_file boundary.

Applied at three sites to prevent drift:
  - llm/caller.py _process_tool_call (call_llm path)
  - llm/caller.py call_agent_llm_with_tools inner loop
  - services/heartbeat.py tool loop

agent_context.py rule 3 augmented with a paragraph teaching the model to
recognize the marker and use read_file for full content rather than
fabricating from prior knowledge.

16 new unit tests covering pass-through, spill, utf-8 round-trip,
threshold boundary, JSON-shape preservation, write-failure fallback, and
a realistic 50-result jina_search end-to-end. 41/41 total pass.

Note: 3 tool loops still exist (caller.py:432, caller.py:763,
heartbeat.py:335). Drift prevention is partial here — PR 3 consolidates
them into a single execute_tool_calls helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oops

Add `services/tool_loop.py` as single source of truth for tool execution.
`execute_tool_calls` runs whitelisted reads concurrently via asyncio.gather;
everything else serializes in original order. Tool result messages come
back in tool_calls[] order regardless of completion order (provider-strict).

Replaces three drifted loop copies:
  - llm/caller.py:_process_tool_call (call_llm)
  - llm/caller.py inner loop in call_agent_llm_with_tools
  - services/heartbeat.py tool loop

heartbeat.py:_TOOLS_REQUIRING_ARGS was a strict superset of caller.py's
(included web_search/jina_search/jina_read) — now unified.

TOOLS_PARALLELIZABLE is conservative — only stateless reads (read_file,
list_files, search_jina, web_fetch, etc.). A2A tools, write/delete,
external messaging, trigger lifecycle, plaza writes, execute_code stay
serial, preserving relationship checks, dedup, max_fires invariants.

Heartbeat plaza rate limiter (1 post / 2 comments per tick) becomes a
pre_execute_hook closure. Plaza tools are non-parallelizable so counters
remain race-free.

14 new tests: static A2A safety isdisjoint, reverse-latency order
preservation, parallel wall-clock concurrency, mixed parallel+serial,
exception isolation, empty-args guard, pre-execute hook short-circuit,
on_tool_call events, callback exception robustness. 55/55 pass.

Performance: a 5-read research turn now completes in max(latencies)
instead of sum(latencies) — typically 3-5x speedup on multi-tool rounds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@TatsuKo-Tsukimi TatsuKo-Tsukimi force-pushed the perf/parallel-tool-execution branch from 85ac090 to 3ebdaab Compare April 27, 2026 09:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ebdaab0a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/app/services/tool_loop.py Outdated
Comment on lines +268 to +272
if ws_path is not None:
from app.services.tool_result_truncation import maybe_truncate_tool_result
tool_content = maybe_truncate_tool_result(
tool_content, call_id=tc["id"], agent_workspace=ws_path,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bypass truncation when reading spill files

This unconditionally re-applies maybe_truncate_tool_result to all tool outputs, including read_file responses. When the model follows the truncation marker and calls read_file("_tool_results/<call_id>.txt"), large files are truncated again here and replaced with a new marker, so the promised “full output” is never actually returned in-context. This is especially problematic for single-line blobs (e.g., minified JSON), where read_file’s line-based paging cannot target subsections, so repeated reads can never recover the missing content.

Useful? React with 👍 / 👎.

The spill-marker design assumes the model can recover oversized tool
output by calling read_file on the spill path. But read_file's own
output was being passed back through maybe_truncate_tool_result, so a
60KB spill would be re-truncated, the spill file overwritten with its
own marker, and the model could never get the original content back —
infinite read loop with no progress.

Bypass list (read_file, read_document) lives next to TOOLS_PARALLELIZABLE
in tool_loop. Both already support offset/limit pagination, so an
oversized retrieval is the model's responsibility to page, not ours to
silently truncate.

Adds three regression tests:
- read_file output passes through unchanged, no spill written
- read_document same
- jina_search (control) still spills and returns the marker

Reported by Codex auto-review on dataelement#490.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1979c02190

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +299 to +301
"status": "done",
"result": result,
"reasoning_content": ctx.full_reasoning_content,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit truncated tool payload in done callback

This callback publishes the raw result even after tool_content may have been replaced with a [truncated ... _tool_results/<call_id>.txt] marker. In our websocket flow, the done payload is what gets persisted to chat history, so oversized tool outputs are later replayed as a clipped fragment without the spill-file pointer, which diverges from what the model saw in-turn and prevents later rounds from recovering the full saved output.

Useful? React with 👍 / 👎.

Comment on lines +134 to +135
spill_root.mkdir(parents=True, exist_ok=True)
full_path.write_text(tool_content, encoding="utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add retention policy for spilled tool-result files

Each oversized tool response is written to a new _tool_results/<call_id>.txt file, and call IDs are per-invocation, so this path grows monotonically. There is no corresponding cleanup in this change, which means long-lived agents with frequent large tool outputs can steadily consume workspace disk until writes/read-modify tools start failing due to exhausted storage.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant