[None][chore] Refactor attention forward context by yuxianq · Pull Request #13662 · NVIDIA/TensorRT-LLM

yuxianq · 2026-04-30T11:21:24Z

Summary by CodeRabbit

Refactor

Unified attention backend configuration across multiple attention implementations (FlashInfer, Vanilla, TRT-LLM, and sparse attention methods) through a centralized context interface.
Streamlined attention forward pass API by consolidating multiple configuration parameters (masks, window sizes, output buffers, and other settings) into a single context object for improved consistency and easier maintenance across the codebase.

Description

Refactor PyTorch attention backend forwarding around an explicit AttentionForwardContext. This removes the TRTLLM attention wrapper, moves the former plan/run data flow into TrtllmAttention.forward and _run, and makes vanilla, FlashInfer, StarAttention, TRTLLM, DSA, and Rocket use the same context merge path.

The merge helper now returns only AttentionForwardContext and rejects unknown forward kwargs. The sparse TRTLLM hooks also accept the context object directly.

Design Documents

Test Coverage

pre-commit run isort --files ...
pre-commit run yapf --files ...
pre-commit run ruff-legacy --files ...
python -m py_compile tensorrt_llm/_torch/attention_backend/interface.py tensorrt_llm/_torch/attention_backend/trtllm.py tensorrt_llm/_torch/attention_backend/vanilla.py tensorrt_llm/_torch/attention_backend/flashinfer.py tensorrt_llm/_torch/attention_backend/star_flashinfer.py tensorrt_llm/_torch/attention_backend/sparse/dsa.py tensorrt_llm/_torch/attention_backend/sparse/rocket.py
git diff --check
B200 validation before the final cleanup: build passed, accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=True-enable_chunked_prefill=True] passed, and accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_auto_dtype passed.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

yuxianq · 2026-04-30T11:22:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T11:29:02Z

PR_Github #46362 [ run ] triggered by Bot. Commit: 65fbeef Link to invocation

tensorrt-cicd · 2026-04-30T13:29:57Z

PR_Github #46362 [ run ] completed with state SUCCESS. Commit: 65fbeef
/LLM/main/L0_MergeRequest_PR pipeline #36448 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-04-30T13:41:54Z

/bot run --disable-fail-fast

yuxianq · 2026-04-30T13:45:16Z

/bot help

github-actions · 2026-04-30T13:45:28Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt-cicd · 2026-04-30T13:49:51Z

PR_Github #46376 [ run ] triggered by Bot. Commit: 46cd18f Link to invocation

tensorrt-cicd · 2026-05-01T04:42:44Z

PR_Github #46376 [ run ] completed with state SUCCESS. Commit: 46cd18f
/LLM/main/L0_MergeRequest_PR pipeline #36458 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-05-01T07:07:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-01T07:13:46Z

PR_Github #46479 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

tensorrt-cicd · 2026-05-01T21:15:26Z

PR_Github #46479 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36542 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-05-02T16:42:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-02T16:48:34Z

PR_Github #46562 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

tensorrt-cicd · 2026-05-02T20:13:48Z

PR_Github #46562 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36616 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-05-03T01:52:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-03T01:59:57Z

PR_Github #46573 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

tensorrt-cicd · 2026-05-03T04:08:26Z

PR_Github #46573 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36624 completed with status: 'SUCCESS'

CI Report

Link to invocation

yuxianq · 2026-05-03T11:59:26Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-03T12:05:20Z

PR_Github #46593 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

tensorrt-cicd · 2026-05-03T16:28:19Z

PR_Github #46593 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36640 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-05-05T01:45:20Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-05T01:59:41Z

PR_Github #46718 [ run ] triggered by Bot. Commit: 81d33fe Link to invocation

coderabbitai · 2026-05-05T02:00:42Z

📝 Walkthrough

Walkthrough

This PR introduces AttentionForwardContext to centralize per-forward attention configuration across all attention backends, replacing scattered keyword arguments. The interface definition is exported and integrated into FlashInfer, Vanilla, StarAttention, DSA, Rocket, and TRT-LLM backends. TRT-LLM undergoes major consolidation by removing the wrapper class and moving its logic into the main attention implementation. Callers in the module layer and test files are updated to construct and pass the context object.

Changes

Attention Forward Context Refactoring

Layer / File(s)	Summary
Interface Definition `tensorrt_llm/_torch/attention_backend/interface.py`	Introduces `AttentionForwardContext` dataclass (with `merge_attention_forward_context()` helper) to encapsulate optional per-forward attention configuration (masks, windows, outputs, scales, RoPE/mRoPE, quantization, generation flags). Updates `AttentionBackend.forward` signature to accept `ctx: Optional[AttentionForwardContext]` and removes explicit `attention_mask` parameter.
Module Exports `tensorrt_llm/_torch/attention_backend/__init__.py`	Exports `AttentionForwardContext` and reformats `.interface` import block.
Backend Implementations `tensorrt_llm/_torch/attention_backend/flashinfer.py`, `vanilla.py`, `star_flashinfer.py`, `sparse/dsa.py`, `sparse/rocket.py`	Each backend updated to accept `ctx: Optional[AttentionForwardContext]`, call `merge_attention_forward_context(ctx, kwargs)` to resolve effective context, and extract configuration from context fields instead of direct parameters.
TRT-LLM Attention Consolidation `tensorrt_llm/_torch/attention_backend/trtllm.py`	Removes `TrtllmAttentionWrapper` class entirely and moves wrapper functionality into `TrtllmAttention`. Refactors initialization to set MLA, RoPE, quantization state directly. Introduces internal helpers (`_run()`, `create_output()`, `_ensure_rope_table_size()`, `_is_nvfp4_output_kernel_available()`) to replace wrapper methods. Replaces plan/run pattern with single `_run()` execution path that validates inputs, selects backend (trtllm_gen vs thop), manages spec-decoding/Blackwell buffers, and handles RoPE/mask/output setup. Updates `forward()` to derive execution parameters from `AttentionForwardContext`.
Caller Integration `tensorrt_llm/_torch/modules/attention.py`	Updates `Attention._attn_impl` and `MLA` attention forward calls to pass configuration via `AttentionForwardContext(...)` instead of scattered keyword arguments and `**kwargs`, including CP-Helix, cached KV, and chunked-prefill paths.
Tests `tests/unittest/_torch/attention/sparse/test_rocketkv.py`, `test_sparse_attention.py`	Imports `AttentionForwardContext` and updates test methods to construct and pass context when calling `sparse_kv_predict()` and `sparse_attn_predict()`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: a refactoring of the attention forward context mechanism across multiple attention backend implementations.
Description check	✅ Passed	The description includes the required sections (Description, Test Coverage, PR Checklist) and clearly explains the refactoring objectives, design documents, comprehensive test coverage, and verification of the PR checklist items.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)

1942-1993: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Grow the RoPE table before invoking the DSA MLA append kernel.

This override now reads self.rotary_cos_sin directly, but unlike TrtllmAttention.mla_rope_append_paged_kv_assign_q() it never calls _ensure_rope_table_size(). Once max_seq_len grows past the constructor-time table size, this path can launch the kernel with an undersized cos/sin buffer.

Suggested fix

     def mla_rope_append_paged_kv_assign_q(
         self,
         q: torch.Tensor,
         latent_cache: torch.Tensor,
         metadata: DSAtrtllmAttentionMetadata,
         is_generation: bool = False,
         **kwargs,
     ) -> None:
         """Apply RoPE, append latent cache to paged KV, and assign query for MLA."""
         if is_generation:
             cached_token_indptr = metadata.gen_cached_token_indptr
             kv_indptr = metadata.gen_kv_indptr
             num_seqs = metadata.num_generations
             max_seq_len = metadata.max_gen_seq_len
             block_offsets = metadata.kv_cache_block_offsets[:, metadata.
                                                             num_contexts:]
         else:
             cached_token_indptr = metadata.ctx_cached_token_indptr
             kv_indptr = metadata.ctx_kv_indptr
             num_seqs = metadata.num_contexts
             max_seq_len = metadata.max_ctx_seq_len
             block_offsets = metadata.kv_cache_block_offsets
         assert self.is_mla_enable and self.mla_params is not None
         assert metadata.kv_cache_manager is not None
+        self._ensure_rope_table_size(metadata.kv_cache_manager.max_seq_len)
 
         sink_token_length = 0
         beam_width = 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py` around lines 1942 -
1993, The MLA RoPE path in mla_rope_append_paged_kv_assign_q reads
self.rotary_cos_sin directly and can launch the DSA kernel with an undersized
cos/sin buffer when max_seq_len has grown; before calling
torch.ops.trtllm.mla_rope_append_paged_kv_assign_q, call the existing rope table
grow helper (e.g. self._ensure_rope_table_size(max_seq_len) or the equivalent
method used by TrtllmAttention.mla_rope_append_paged_kv_assign_q) using the
computed max_seq_len so self.rotary_cos_sin is large enough, keeping the rest of
the call and parameters unchanged.

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1586-1599: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat out_scale_sf as a quantized-output request too.

When output is absent, auto-allocation only enters the quantized path if ctx.out_scale is set. The NVFP4 path consumes ctx.out_scale_sf instead, so callers that rely on auto-allocation can silently fall back to a dense output buffer and skip the NVFP4 output path.

Suggested fix

         if output is None:
             # Output is not provided.
             is_gen_only = ctx.attention_input_type == AttentionInputType.generation_only
             outputs = self.create_output(
                 q,
-                is_quantize_output=ctx.out_scale is not None,
+                is_quantize_output=(ctx.out_scale is not None
+                                    or ctx.out_scale_sf is not None),
                 metadata=metadata,
                 attention_mask=ctx.attention_mask,
                 use_paged_context_fmha=use_paged_context_fmha,
                 is_mla_enable=self.is_mla_enable,
                 is_gen_only=is_gen_only,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1586 - 1599,
The current check for auto-allocation only treats the presence of ctx.out_scale
as an indicator for quantized output, but it should also consider
ctx.out_scale_sf to correctly handle NVFP4 paths. In the code section around the
output allocation logic using self.create_output, update the condition for
is_quantize_output to check if either ctx.out_scale or ctx.out_scale_sf is not
None. This ensures the NVFP4 quantized output path is properly triggered during
auto-allocation.

tensorrt_llm/_torch/attention_backend/vanilla.py (1)

308-318: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass ctx.attention_window_size through the no-cache path.

The cached branch honors the context field via _single_request_forward() (line 429), but no_kv_cache_forward() (line 308) does not accept it. Line 397 only passes attention_mask, dropping ctx.attention_window_size. Consequently, flash_attn_varlen_func() runs with the commented-out infinite window (window_size=(-1, -1)) instead of the configured window size, causing sliding-window no-cache requests to produce full-causal outputs instead of local attention.

To fix, add attention_window_size: Optional[int] = None to no_kv_cache_forward() signature, pass it from forward() (line 397), and use it in the flash_attn_varlen_func() call (line 359) with window_size=(attention_window_size - 1, 0) when set.

Also applies to: 383–397
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/vanilla.py` around lines 308 - 318, The
no-cache path must accept and propagate the context window size: add an optional
parameter attention_window_size: Optional[int] = None to no_kv_cache_forward and
update forward to pass ctx.attention_window_size into that call (matching how
_single_request_forward uses it); then, in the flash_attn_varlen_func invocation
inside no_kv_cache_forward, replace the current default/infinite window with
window_size=(attention_window_size - 1, 0) when attention_window_size is set
(otherwise keep the existing fallback), ensuring flash_attn_varlen_func receives
the correct sliding-window size for local attention.

🧹 Nitpick comments (1)

tests/unittest/_torch/attention/sparse/test_rocketkv.py (1)
263-266: ⚡ Quick win

Add one assertion for a non-default context path.

Both new calls pass AttentionForwardContext() with all defaults, so they only verify signature plumbing. They would still pass if a non-default field were dropped during propagation, or if merge_attention_forward_context(...) regressed on its error paths. A small backend-level test that goes through TrtllmAttention.forward (or the merge helper directly) with one non-default field and one mixed ctx + legacy-kwargs rejection case would cover the risky part of this refactor much better.

As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

Also applies to: 529-531
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention/sparse/test_rocketkv.py` around lines 263 -
266, Add a test that exercises a non-default AttentionForwardContext and the
mixed-ctx/legacy kwargs rejection: construct an AttentionForwardContext with at
least one non-default field set (e.g., a non-empty torch device/flag or an
explicit timestamp/seed field used by merge_attention_forward_context), call
trtllm_attn.sparse_kv_predict (or TrtllmAttention.forward) with that ctx and
assert the propagated/merged values are preserved, then add a second case that
passes both a ctx and a legacy kwarg (to trigger merge_attention_forward_context
rejection) and assert it raises the expected error; reference
AttentionForwardContext, trtllm_attn.sparse_kv_predict, TrtllmAttention.forward,
and merge_attention_forward_context to locate where to modify tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/flashinfer.py`:
- Around line 756-759: Replace the assert with an explicit runtime check that
raises a ValueError when ctx.attention_mask == CustomAttentionMask.CUSTOM and
attention_mask_data is None: in the block around attention_mask_data,
ctx.attention_mask, CustomAttentionMask.CUSTOM, and attention_mask_type (set to
int(AttentionMaskType.custom_mask)), change the assert to an if-statement that
raises ValueError("attention_mask_data is required for custom attention mask.")
so validation is preserved even under -O optimization.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1132-1137: Ensure the cached RoPE max-position metadata is updated
when growing the table: in _ensure_rope_table_size (after updating
self.rope_params.max_positions and recreating
self.rotary_inv_freq/self.rotary_cos_sin via
self.rope_params.create_rope_const_params()) also assign the new max value to
self.rotary_embedding_max_positions so _run() will forward the correct, in-sync
max-position metadata; update any related code paths that rely on
rotary_embedding_max_positions to use this refreshed value.

---

Outside diff comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 1942-1993: The MLA RoPE path in mla_rope_append_paged_kv_assign_q
reads self.rotary_cos_sin directly and can launch the DSA kernel with an
undersized cos/sin buffer when max_seq_len has grown; before calling
torch.ops.trtllm.mla_rope_append_paged_kv_assign_q, call the existing rope table
grow helper (e.g. self._ensure_rope_table_size(max_seq_len) or the equivalent
method used by TrtllmAttention.mla_rope_append_paged_kv_assign_q) using the
computed max_seq_len so self.rotary_cos_sin is large enough, keeping the rest of
the call and parameters unchanged.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1586-1599: The current check for auto-allocation only treats the
presence of ctx.out_scale as an indicator for quantized output, but it should
also consider ctx.out_scale_sf to correctly handle NVFP4 paths. In the code
section around the output allocation logic using self.create_output, update the
condition for is_quantize_output to check if either ctx.out_scale or
ctx.out_scale_sf is not None. This ensures the NVFP4 quantized output path is
properly triggered during auto-allocation.

In `@tensorrt_llm/_torch/attention_backend/vanilla.py`:
- Around line 308-318: The no-cache path must accept and propagate the context
window size: add an optional parameter attention_window_size: Optional[int] =
None to no_kv_cache_forward and update forward to pass ctx.attention_window_size
into that call (matching how _single_request_forward uses it); then, in the
flash_attn_varlen_func invocation inside no_kv_cache_forward, replace the
current default/infinite window with window_size=(attention_window_size - 1, 0)
when attention_window_size is set (otherwise keep the existing fallback),
ensuring flash_attn_varlen_func receives the correct sliding-window size for
local attention.

---

Nitpick comments:
In `@tests/unittest/_torch/attention/sparse/test_rocketkv.py`:
- Around line 263-266: Add a test that exercises a non-default
AttentionForwardContext and the mixed-ctx/legacy kwargs rejection: construct an
AttentionForwardContext with at least one non-default field set (e.g., a
non-empty torch device/flag or an explicit timestamp/seed field used by
merge_attention_forward_context), call trtllm_attn.sparse_kv_predict (or
TrtllmAttention.forward) with that ctx and assert the propagated/merged values
are preserved, then add a second case that passes both a ctx and a legacy kwarg
(to trigger merge_attention_forward_context rejection) and assert it raises the
expected error; reference AttentionForwardContext,
trtllm_attn.sparse_kv_predict, TrtllmAttention.forward, and
merge_attention_forward_context to locate where to modify tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9148f771-7dc2-4722-9d12-4598321aa4e0

📥 Commits

Reviewing files that changed from the base of the PR and between ad2fc22 and 81d33fe.

📒 Files selected for processing (11)

tensorrt_llm/_torch/attention_backend/__init__.py
tensorrt_llm/_torch/attention_backend/flashinfer.py
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
tensorrt_llm/_torch/attention_backend/sparse/rocket.py
tensorrt_llm/_torch/attention_backend/star_flashinfer.py
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/attention_backend/vanilla.py
tensorrt_llm/_torch/modules/attention.py
tests/unittest/_torch/attention/sparse/test_rocketkv.py
tests/unittest/_torch/attention/sparse/test_sparse_attention.py

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-05-05T08:10:25Z

/bot run --disable-fail-fast --add-multi-gpu-test

yihwang-nv

Thanks, LGTM!

tensorrt-cicd · 2026-05-05T08:16:36Z

PR_Github #46776 [ run ] triggered by Bot. Commit: 39b591c Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-05-05T09:07:48Z

/bot run --disable-fail-fast --add-multi-gpu-test

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-05-05T09:20:59Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-05T09:28:11Z

PR_Github #46786 [ run ] triggered by Bot. Commit: d399538 Link to invocation

QiJune

LGTM

yuxianq · 2026-05-06T08:07:11Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-06T08:13:15Z

PR_Github #46952 [ run ] triggered by Bot. Commit: 32d6532 Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-05-06T08:18:42Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-06T08:25:43Z

PR_Github #46956 [ run ] triggered by Bot. Commit: 71c0dd4 Link to invocation

tensorrt-cicd · 2026-05-06T08:25:46Z

PR_Github #46952 [ run ] completed with state ABORTED. Commit: 32d6532

Link to invocation

tensorrt-cicd · 2026-05-06T16:49:17Z

PR_Github #46956 [ run ] completed with state SUCCESS. Commit: 71c0dd4
/LLM/main/L0_MergeRequest_PR pipeline #36950 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-05-06T17:03:30Z

/bot run --disable-fail-fast --add-multi-gpu-test

tburt-nv · 2026-05-06T21:37:44Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-06T21:43:26Z

PR_Github #47037 [ run ] triggered by Bot. Commit: 71c0dd4 Link to invocation

github-actions Bot assigned yuxianq Apr 30, 2026

yuxianq force-pushed the attention-forward-context branch from 65fbeef to 46cd18f Compare April 30, 2026 13:41

yuxianq added 2 commits May 5, 2026 01:44

[None][chore] Refactor attention forward context

15ea360

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[None][fix] Fix sparse attention forward context tests

81d33fe

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq force-pushed the attention-forward-context branch from e2d1938 to 81d33fe Compare May 5, 2026 01:44

yuxianq marked this pull request as ready for review May 5, 2026 01:52

yuxianq requested a review from a team as a code owner May 5, 2026 01:52

yuxianq requested review from QiJune and brb-nv May 5, 2026 01:52

yuxianq requested a review from yihwang-nv May 5, 2026 01:53

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/flashinfer.py Outdated

Comment thread tensorrt_llm/_torch/attention_backend/trtllm.py

yuxianq requested a review from juney-nvidia May 5, 2026 07:38

[None][fix] Use live RoPE params in TRTLLM attention

39b591c

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yihwang-nv approved these changes May 5, 2026

View reviewed changes

[None][doc] Update attention forward context docs

3cfc650

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq requested review from a team as code owners May 5, 2026 09:07

yuxianq added 2 commits May 5, 2026 09:14

[None][doc] Simplify FlashInfer attention docs

91f5a5b

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[None][doc] Clarify FlashInfer metadata docs

d399538

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

QiJune reviewed May 6, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/interface.py

QiJune approved these changes May 6, 2026

View reviewed changes

[None][fix] Rename attention forward args

71c0dd4

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq force-pushed the attention-forward-context branch from 32d6532 to 71c0dd4 Compare May 6, 2026 08:18

Conversation

yuxianq commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Refactor

Description

Design Documents

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

yuxianq commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

yuxianq commented Apr 30, 2026

Uh oh!

yuxianq commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

yuxianq commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

yuxianq commented May 2, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

yuxianq commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

yuxianq commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

yuxianq commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuxianq commented May 5, 2026

Uh oh!

yihwang-nv left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

yuxianq commented May 5, 2026

Uh oh!

yuxianq commented May 5, 2026

yuxianq commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading