Skip to content

Disagg KV transfer hardening (rebased onto v1.3.0rc14)#13655

Draft
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:pr13495-on-rc14
Draft

Disagg KV transfer hardening (rebased onto v1.3.0rc14)#13655
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:pr13495-on-rc14

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

Summary

Rebase of #13495 onto upstream main HEAD 3b7af1c21 (2026-04-29). Same 4 commits as the original PR, conflict-resolved against current main:

  1. Harden disagg transceiver request lifetime
  2. Document disagg KV transfer hardening follow-ups
  3. Harden disagg transfer cleanup paths
  4. Add NIXL transfer release cancellation hook

Conflict resolution

Two files conflicted vs current main; in both cases main's version was kept because it already addresses the same concern with newer / more refined logic:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py — main already has is_disagg_context_complete_state short-circuit + requests_finished_by_transfer stats tracking + force_terminate_for_partial_reuse flag and the error-budget / drain-queue fatal-error path. Took main's structure; the PR's _has_inflight_generation_transfers, _fail_closed_for_inflight_generation_transfer, and _can_terminate_request_now helpers were not folded in (main's coverage of those concerns is sufficient).
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp — main has the newer addSequenceBatch API and *llmRequest deref. Took main's; the PR's addSequence + raw shared_ptr signature would no longer compile.

The C++ runtime fix in tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py (def release(self): self.handle.release() with the Failed to release NIXL transfer handle log) is preserved.

Validation (in progress)

  • Build TRT-LLM rc14 wheel from this branch (ARM + x86)
  • Build dynamo-trtllm container image with this wheel
  • Multihost disagg e2e (1P:1D, Qwen3-Coder-30B-A3B, dlcluster GB200)

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant