feat(agent): token-aware history truncation alongside message cap#488
feat(agent): token-aware history truncation alongside message cap#488TatsuKo-Tsukimi wants to merge 2 commits intodataelement:mainfrom
Conversation
Replace naive `conversation[-ctx_size:]` slicing with a walker that treats assistant.tool_calls and its matching role="tool" messages as one atomic block. Naive slicing can leave an orphan role="tool" at the head when the cut lands mid-pair — OpenAI rejects this with "No tool call found for function call output" (issue dataelement#446). New helper `services/history_window.truncate_by_message_count` walks tail backward, identifies blocks, and includes blocks whole or not at all. Orphan tools are silently dropped regardless of budget. Replaces head-only pop guard in: - app/api/websocket.py (web chat) - app/api/feishu.py (feishu channel) Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply path's hardcoded [-10:] is intentional). Tests: 15 covering empty input, budget edges, parallel tool_calls, multi-pair, head/mid orphans, realistic 60-message invariant. Addresses dataelement#446 failure mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 01aabbd906
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _truncated = truncate_by_token_budget( | ||
| conversation, tok_budget, message_cap=ctx_size, | ||
| ) |
There was a problem hiding this comment.
Keep latest user turn outside history token cap
This truncates conversation after appending the current user message, so the token budget is applied to the newest turn as well as history. When the incoming turn is large (for example a vision message containing [image_data:...base64,...]), truncate_by_token_budget can reject the first block and return [], causing call_llm_with_failover to run without the user’s prompt. That is a functional regression: the model can respond off-context or fail for oversized user inputs, even though this setting is described and implemented elsewhere as a history budget.
Useful? React with 👍 / 👎.
01aabbd to
433a359
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 433a359689
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ctx_size = agent.context_window_size or 100 | ||
| logger.info(f"[WS] Agent: {agent_name}, type: {agent_type}, model_id: {agent.primary_model_id}, ctx: {ctx_size}") | ||
| from app.models.agent import DEFAULT_CONTEXT_WINDOW_TOKENS | ||
| tok_budget = getattr(agent, "context_window_tokens", None) or DEFAULT_CONTEXT_WINDOW_TOKENS |
There was a problem hiding this comment.
Re-read token budget for each websocket turn
websocket_chat reloads agent/model config on each incoming message, but tok_budget is only set once at connection setup, so active sockets keep using the old context_window_tokens value for all later turns. This makes the new setting effectively inert until reconnect (for example, lowering the budget during an incident will not constrain prompt size on existing sessions), which defeats the runtime tuning added in this change.
Useful? React with 👍 / 👎.
Add `agents.context_window_tokens` field (default 50000) and a new
`truncate_by_token_budget` helper that bounds in-context history by both
estimated token cost (primary) and message count (safety cap), preserving
assistant.tool_calls ↔ role=tool pairs intact via the same walker as
truncate_by_message_count.
Why: message-count alone is a wildly variable proxy for token cost — one
50KB tool result eats more context than 100 short user messages. Token
budget gives predictable behavior across heterogeneous traffic; message
cap remains as a safety net against pathological tiny-message floods.
Changes:
- models/agent.py: + context_window_tokens (Integer, default=50000)
+ DEFAULT_CONTEXT_WINDOW_TOKENS constant
- schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000)
- alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS)
- services/history_window.py: + truncate_by_token_budget, refactored
common walker, JSON-serialized char->token estimate via existing
estimate_tokens_from_chars (chars/3 — overestimates safely)
- api/websocket.py: pass tok_budget to helper, raise DB load to
max(ctx_size, 500) so helper has room to choose
- api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
- frontend: AgentDetail Settings slider + i18n + types
10 new tests covering token-budget mode (huge-message dropped, both-bounds
interaction, atomic pair preservation, orphan defense). 25/25 pass.
Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use
DB-level message-count limit only — they don't get token awareness in
this PR but won't crash. Migrating them is follow-up scope.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
433a359 to
2a0040f
Compare
Summary
Add
agents.context_window_tokensfield (default50000) and a newtruncate_by_token_budgethelper that bounds in-context history by both estimated token cost (primary) and message count (safety cap), preservingassistant.tool_calls↔role=toolpairs intact via the same walker astruncate_by_message_count.Why: message-count alone is a wildly variable proxy for token cost — one 50KB tool result eats more context than 100 short user messages. Token budget gives predictable behavior across heterogeneous traffic; message cap remains as a safety net against pathological tiny-message floods.
Changes
models/agent.py: +context_window_tokens(Integer, default=50000) +DEFAULT_CONTEXT_WINDOW_TOKENSconstantschemas/schemas.py: AgentOut, AgentUpdate (1000 ≤ tokens ≤ 500000)alembic/versions/add_context_window_tokens.py: idempotentADD COLUMN IF NOT EXISTS, default 50000 backfills existing agentsservices/history_window.py: +truncate_by_token_budget, refactored common walker, JSON-serialized char→token estimate via existingservices/token_tracker.estimate_tokens_from_chars(chars/3 — overestimates safely)api/websocket.py: pass tok_budget to helper, raise DB load tomax(ctx_size, 500)so helper has room to chooseapi/feishu.py: same pattern at 2 sites (web chat + IM channel paths)10 new tests covering token-budget mode (huge-message dropped, both-bounds interaction, atomic pair preservation, orphan defense). 26/26 history_window tests pass.
Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use DB-level message-count limit only — they don't get token awareness in this PR but won't crash. Migrating them is follow-up scope.
Stack
PR 2/4 of context-budget hygiene series.
Diff against
mainshows commits from #487 too (cumulative). Review the delta by looking at the latest commit only. Once #487 merges, this PR's diff naturally shrinks.Test plan
pytest backend/tests/test_history_window.py— 26/26 pass (16 from fix(history): atomic pair-aware truncation for tool_call blocks #487 + 10 new)alembic upgrade headSQL shape verified (idempotent IF NOT EXISTS)chars/3token estimator over-counts for English, under-counts for CJK — acceptable risk?max(ctx_size, 500)won't cause memory issues for tenants with very long conversation history🤖 Generated with Claude Code