Skip to content

feat(agent): token-aware history truncation alongside message cap#488

Open
TatsuKo-Tsukimi wants to merge 2 commits intodataelement:mainfrom
TatsuKo-Tsukimi:feat/context-window-tokens
Open

feat(agent): token-aware history truncation alongside message cap#488
TatsuKo-Tsukimi wants to merge 2 commits intodataelement:mainfrom
TatsuKo-Tsukimi:feat/context-window-tokens

Conversation

@TatsuKo-Tsukimi
Copy link
Copy Markdown

Summary

Add agents.context_window_tokens field (default 50000) and a new truncate_by_token_budget helper that bounds in-context history by both estimated token cost (primary) and message count (safety cap), preserving assistant.tool_callsrole=tool pairs intact via the same walker as truncate_by_message_count.

Why: message-count alone is a wildly variable proxy for token cost — one 50KB tool result eats more context than 100 short user messages. Token budget gives predictable behavior across heterogeneous traffic; message cap remains as a safety net against pathological tiny-message floods.

Changes

  • models/agent.py: + context_window_tokens (Integer, default=50000) + DEFAULT_CONTEXT_WINDOW_TOKENS constant
  • schemas/schemas.py: AgentOut, AgentUpdate (1000 ≤ tokens ≤ 500000)
  • alembic/versions/add_context_window_tokens.py: idempotent ADD COLUMN IF NOT EXISTS, default 50000 backfills existing agents
  • services/history_window.py: + truncate_by_token_budget, refactored common walker, JSON-serialized char→token estimate via existing services/token_tracker.estimate_tokens_from_chars (chars/3 — overestimates safely)
  • api/websocket.py: pass tok_budget to helper, raise DB load to max(ctx_size, 500) so helper has room to choose
  • api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
  • frontend: AgentDetail Settings slider + i18n + types

10 new tests covering token-budget mode (huge-message dropped, both-bounds interaction, atomic pair preservation, orphan defense). 26/26 history_window tests pass.

Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use DB-level message-count limit only — they don't get token awareness in this PR but won't crash. Migrating them is follow-up scope.

Stack

PR 2/4 of context-budget hygiene series.

Diff against main shows commits from #487 too (cumulative). Review the delta by looking at the latest commit only. Once #487 merges, this PR's diff naturally shrinks.

Test plan

  • pytest backend/tests/test_history_window.py — 26/26 pass (16 from fix(history): atomic pair-aware truncation for tool_call blocks #487 + 10 new)
  • alembic upgrade head SQL shape verified (idempotent IF NOT EXISTS)
  • Reviewer: default 50000 tokens — too aggressive / too conservative? (covers ~10-15K system prompt + ~35K conversation on 128K context models)
  • Reviewer: chars/3 token estimator over-counts for English, under-counts for CJK — acceptable risk?
  • Reviewer: confirm raising DB-level limit to max(ctx_size, 500) won't cause memory issues for tenants with very long conversation history

🤖 Generated with Claude Code

Replace naive `conversation[-ctx_size:]` slicing with a walker that treats
assistant.tool_calls and its matching role="tool" messages as one atomic
block. Naive slicing can leave an orphan role="tool" at the head when the
cut lands mid-pair — OpenAI rejects this with "No tool call found for
function call output" (issue dataelement#446).

New helper `services/history_window.truncate_by_message_count` walks tail
backward, identifies blocks, and includes blocks whole or not at all.
Orphan tools are silently dropped regardless of budget.

Replaces head-only pop guard in:
  - app/api/websocket.py (web chat)
  - app/api/feishu.py (feishu channel)

Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply
path's hardcoded [-10:] is intentional).

Tests: 15 covering empty input, budget edges, parallel tool_calls,
multi-pair, head/mid orphans, realistic 60-message invariant.

Addresses dataelement#446 failure mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01aabbd906

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +677 to +679
_truncated = truncate_by_token_budget(
conversation, tok_budget, message_cap=ctx_size,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep latest user turn outside history token cap

This truncates conversation after appending the current user message, so the token budget is applied to the newest turn as well as history. When the incoming turn is large (for example a vision message containing [image_data:...base64,...]), truncate_by_token_budget can reject the first block and return [], causing call_llm_with_failover to run without the user’s prompt. That is a functional regression: the model can respond off-context or fail for oversized user inputs, even though this setting is described and implemented elsewhere as a history budget.

Useful? React with 👍 / 👎.

@TatsuKo-Tsukimi TatsuKo-Tsukimi force-pushed the feat/context-window-tokens branch from 01aabbd to 433a359 Compare April 27, 2026 08:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 433a359689

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ctx_size = agent.context_window_size or 100
logger.info(f"[WS] Agent: {agent_name}, type: {agent_type}, model_id: {agent.primary_model_id}, ctx: {ctx_size}")
from app.models.agent import DEFAULT_CONTEXT_WINDOW_TOKENS
tok_budget = getattr(agent, "context_window_tokens", None) or DEFAULT_CONTEXT_WINDOW_TOKENS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-read token budget for each websocket turn

websocket_chat reloads agent/model config on each incoming message, but tok_budget is only set once at connection setup, so active sockets keep using the old context_window_tokens value for all later turns. This makes the new setting effectively inert until reconnect (for example, lowering the budget during an incident will not constrain prompt size on existing sessions), which defeats the runtime tuning added in this change.

Useful? React with 👍 / 👎.

Add `agents.context_window_tokens` field (default 50000) and a new
`truncate_by_token_budget` helper that bounds in-context history by both
estimated token cost (primary) and message count (safety cap), preserving
assistant.tool_calls ↔ role=tool pairs intact via the same walker as
truncate_by_message_count.

Why: message-count alone is a wildly variable proxy for token cost — one
50KB tool result eats more context than 100 short user messages. Token
budget gives predictable behavior across heterogeneous traffic; message
cap remains as a safety net against pathological tiny-message floods.

Changes:
  - models/agent.py: + context_window_tokens (Integer, default=50000)
    + DEFAULT_CONTEXT_WINDOW_TOKENS constant
  - schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000)
  - alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS)
  - services/history_window.py: + truncate_by_token_budget, refactored
    common walker, JSON-serialized char->token estimate via existing
    estimate_tokens_from_chars (chars/3 — overestimates safely)
  - api/websocket.py: pass tok_budget to helper, raise DB load to
    max(ctx_size, 500) so helper has room to choose
  - api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
  - frontend: AgentDetail Settings slider + i18n + types

10 new tests covering token-budget mode (huge-message dropped, both-bounds
interaction, atomic pair preservation, orphan defense). 25/25 pass.

Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use
DB-level message-count limit only — they don't get token awareness in
this PR but won't crash. Migrating them is follow-up scope.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@TatsuKo-Tsukimi TatsuKo-Tsukimi force-pushed the feat/context-window-tokens branch from 433a359 to 2a0040f Compare April 27, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant