Skip to content

Disagg KV transfer hardening (rebased onto v1.3.0rc13)#13661

Draft
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:pr13495-on-rc13
Draft

Disagg KV transfer hardening (rebased onto v1.3.0rc13)#13661
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:pr13495-on-rc13

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

Summary

Rebase of #13495 onto the v1.3.0rc13 release tag (b9ce4b69d, 2026-04-26). Same 4 commits as the original PR:

  1. Harden disagg transceiver request lifetime
  2. Document disagg KV transfer hardening follow-ups
  3. Harden disagg transfer cleanup paths
  4. Add NIXL transfer release cancellation hook

This rebase pairs with ai-dynamo/dynamo#8870 (which bumps dynamo's stock pin from rc11 → rc13 — first published rc tag with #12976's admission fix). Together they enable a dynamo image that has both PR #12976's admission fix (in rc13) and PR #13495's NIXL release cancellation hook (this branch), without needing to chase TRT-LLM main HEAD.

A separate version of this PR rebased onto current TRT-LLM main (3b7af1c21) is at #13655.

Conflict resolution

Only 1 file conflicted vs rc13: tensorrt_llm/_torch/pyexecutor/py_executor.py. Two conflict regions, both took --ours (rc13 base):

  • rc13 already includes the is_disagg_context_complete_state short-circuit + requests_finished_by_transfer stats tracking + force_terminate_for_partial_reuse flag (these landed independently between rc11 and rc13). The PR's pass-only handling of the same state would lose the stats tracking, so taking rc13 is strictly more.
  • The error-budget / drain-queue fatal-error path on rc13 supersedes the PR's _has_inflight_generation_transfers / _fail_closed_for_inflight_generation_transfer helpers, which were not folded in.

The C++ runtime fix in tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py (def release(self): self.handle.release() with the Failed to release NIXL transfer handle log line) is preserved verbatim — verified by extracting from the built wheel.

Net diff vs rc13: +321 / -34 across 11 files.

Validation (in progress)

  • Build TRT-LLM wheel from this branch (ARM in progress; x86 in progress)
  • Build dynamo container image with the wheel + rc13 dynamo source
  • Multihost disagg e2e (1P:1D, Qwen3-Coder-30B-A3B, dlcluster GB200)

🤖 Generated with Claude Code

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants