Skip to content

[None][chore] Refactor attention forward context#13662

Open
yuxianq wants to merge 7 commits intoNVIDIA:mainfrom
yuxianq:attention-forward-context
Open

[None][chore] Refactor attention forward context#13662
yuxianq wants to merge 7 commits intoNVIDIA:mainfrom
yuxianq:attention-forward-context

Conversation

@yuxianq
Copy link
Copy Markdown
Collaborator

@yuxianq yuxianq commented Apr 30, 2026

Summary by CodeRabbit

Refactor

  • Unified attention backend configuration across multiple attention implementations (FlashInfer, Vanilla, TRT-LLM, and sparse attention methods) through a centralized context interface.
  • Streamlined attention forward pass API by consolidating multiple configuration parameters (masks, window sizes, output buffers, and other settings) into a single context object for improved consistency and easier maintenance across the codebase.

Description

Refactor PyTorch attention backend forwarding around an explicit AttentionForwardContext. This removes the TRTLLM attention wrapper, moves the former plan/run data flow into TrtllmAttention.forward and _run, and makes vanilla, FlashInfer, StarAttention, TRTLLM, DSA, and Rocket use the same context merge path.

The merge helper now returns only AttentionForwardContext and rejects unknown forward kwargs. The sparse TRTLLM hooks also accept the context object directly.

Design Documents

Test Coverage

  • pre-commit run isort --files ...
  • pre-commit run yapf --files ...
  • pre-commit run ruff-legacy --files ...
  • python -m py_compile tensorrt_llm/_torch/attention_backend/interface.py tensorrt_llm/_torch/attention_backend/trtllm.py tensorrt_llm/_torch/attention_backend/vanilla.py tensorrt_llm/_torch/attention_backend/flashinfer.py tensorrt_llm/_torch/attention_backend/star_flashinfer.py tensorrt_llm/_torch/attention_backend/sparse/dsa.py tensorrt_llm/_torch/attention_backend/sparse/rocket.py
  • git diff --check
  • B200 validation before the final cleanup: build passed, accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=True-enable_chunked_prefill=True] passed, and accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_auto_dtype passed.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented Apr 30, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46362 [ run ] triggered by Bot. Commit: 65fbeef Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46362 [ run ] completed with state SUCCESS. Commit: 65fbeef
/LLM/main/L0_MergeRequest_PR pipeline #36448 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq yuxianq force-pushed the attention-forward-context branch from 65fbeef to 46cd18f Compare April 30, 2026 13:41
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented Apr 30, 2026

/bot run --disable-fail-fast

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented Apr 30, 2026

/bot help

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46376 [ run ] triggered by Bot. Commit: 46cd18f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46376 [ run ] completed with state SUCCESS. Commit: 46cd18f
/LLM/main/L0_MergeRequest_PR pipeline #36458 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46479 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46479 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36542 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 2, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46562 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46562 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36616 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 3, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46573 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46573 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36624 completed with status: 'SUCCESS'

CI Report

Link to invocation

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 3, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46593 [ run ] triggered by Bot. Commit: e2d1938 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46593 [ run ] completed with state SUCCESS. Commit: e2d1938
/LLM/main/L0_MergeRequest_PR pipeline #36640 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq added 2 commits May 5, 2026 01:44
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq force-pushed the attention-forward-context branch from e2d1938 to 81d33fe Compare May 5, 2026 01:44
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 5, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@yuxianq yuxianq marked this pull request as ready for review May 5, 2026 01:52
@yuxianq yuxianq requested a review from a team as a code owner May 5, 2026 01:52
@yuxianq yuxianq requested review from QiJune and brb-nv May 5, 2026 01:52
@yuxianq yuxianq requested a review from yihwang-nv May 5, 2026 01:53
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46718 [ run ] triggered by Bot. Commit: 81d33fe Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

This PR introduces AttentionForwardContext to centralize per-forward attention configuration across all attention backends, replacing scattered keyword arguments. The interface definition is exported and integrated into FlashInfer, Vanilla, StarAttention, DSA, Rocket, and TRT-LLM backends. TRT-LLM undergoes major consolidation by removing the wrapper class and moving its logic into the main attention implementation. Callers in the module layer and test files are updated to construct and pass the context object.

Changes

Attention Forward Context Refactoring

Layer / File(s) Summary
Interface Definition
tensorrt_llm/_torch/attention_backend/interface.py
Introduces AttentionForwardContext dataclass (with merge_attention_forward_context() helper) to encapsulate optional per-forward attention configuration (masks, windows, outputs, scales, RoPE/mRoPE, quantization, generation flags). Updates AttentionBackend.forward signature to accept ctx: Optional[AttentionForwardContext] and removes explicit attention_mask parameter.
Module Exports
tensorrt_llm/_torch/attention_backend/__init__.py
Exports AttentionForwardContext and reformats .interface import block.
Backend Implementations
tensorrt_llm/_torch/attention_backend/flashinfer.py, vanilla.py, star_flashinfer.py, sparse/dsa.py, sparse/rocket.py
Each backend updated to accept ctx: Optional[AttentionForwardContext], call merge_attention_forward_context(ctx, kwargs) to resolve effective context, and extract configuration from context fields instead of direct parameters.
TRT-LLM Attention Consolidation
tensorrt_llm/_torch/attention_backend/trtllm.py
Removes TrtllmAttentionWrapper class entirely and moves wrapper functionality into TrtllmAttention. Refactors initialization to set MLA, RoPE, quantization state directly. Introduces internal helpers (_run(), create_output(), _ensure_rope_table_size(), _is_nvfp4_output_kernel_available()) to replace wrapper methods. Replaces plan/run pattern with single _run() execution path that validates inputs, selects backend (trtllm_gen vs thop), manages spec-decoding/Blackwell buffers, and handles RoPE/mask/output setup. Updates forward() to derive execution parameters from AttentionForwardContext.
Caller Integration
tensorrt_llm/_torch/modules/attention.py
Updates Attention._attn_impl and MLA attention forward calls to pass configuration via AttentionForwardContext(...) instead of scattered keyword arguments and **kwargs, including CP-Helix, cached KV, and chunked-prefill paths.
Tests
tests/unittest/_torch/attention/sparse/test_rocketkv.py, test_sparse_attention.py
Imports AttentionForwardContext and updates test methods to construct and pass context when calling sparse_kv_predict() and sparse_attn_predict().

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: a refactoring of the attention forward context mechanism across multiple attention backend implementations.
Description check ✅ Passed The description includes the required sections (Description, Test Coverage, PR Checklist) and clearly explains the refactoring objectives, design documents, comprehensive test coverage, and verification of the PR checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)

1942-1993: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Grow the RoPE table before invoking the DSA MLA append kernel.

This override now reads self.rotary_cos_sin directly, but unlike TrtllmAttention.mla_rope_append_paged_kv_assign_q() it never calls _ensure_rope_table_size(). Once max_seq_len grows past the constructor-time table size, this path can launch the kernel with an undersized cos/sin buffer.

Suggested fix
     def mla_rope_append_paged_kv_assign_q(
         self,
         q: torch.Tensor,
         latent_cache: torch.Tensor,
         metadata: DSAtrtllmAttentionMetadata,
         is_generation: bool = False,
         **kwargs,
     ) -> None:
         """Apply RoPE, append latent cache to paged KV, and assign query for MLA."""
         if is_generation:
             cached_token_indptr = metadata.gen_cached_token_indptr
             kv_indptr = metadata.gen_kv_indptr
             num_seqs = metadata.num_generations
             max_seq_len = metadata.max_gen_seq_len
             block_offsets = metadata.kv_cache_block_offsets[:, metadata.
                                                             num_contexts:]
         else:
             cached_token_indptr = metadata.ctx_cached_token_indptr
             kv_indptr = metadata.ctx_kv_indptr
             num_seqs = metadata.num_contexts
             max_seq_len = metadata.max_ctx_seq_len
             block_offsets = metadata.kv_cache_block_offsets
         assert self.is_mla_enable and self.mla_params is not None
         assert metadata.kv_cache_manager is not None
+        self._ensure_rope_table_size(metadata.kv_cache_manager.max_seq_len)
 
         sink_token_length = 0
         beam_width = 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py` around lines 1942 -
1993, The MLA RoPE path in mla_rope_append_paged_kv_assign_q reads
self.rotary_cos_sin directly and can launch the DSA kernel with an undersized
cos/sin buffer when max_seq_len has grown; before calling
torch.ops.trtllm.mla_rope_append_paged_kv_assign_q, call the existing rope table
grow helper (e.g. self._ensure_rope_table_size(max_seq_len) or the equivalent
method used by TrtllmAttention.mla_rope_append_paged_kv_assign_q) using the
computed max_seq_len so self.rotary_cos_sin is large enough, keeping the rest of
the call and parameters unchanged.
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1586-1599: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat out_scale_sf as a quantized-output request too.

When output is absent, auto-allocation only enters the quantized path if ctx.out_scale is set. The NVFP4 path consumes ctx.out_scale_sf instead, so callers that rely on auto-allocation can silently fall back to a dense output buffer and skip the NVFP4 output path.

Suggested fix
         if output is None:
             # Output is not provided.
             is_gen_only = ctx.attention_input_type == AttentionInputType.generation_only
             outputs = self.create_output(
                 q,
-                is_quantize_output=ctx.out_scale is not None,
+                is_quantize_output=(ctx.out_scale is not None
+                                    or ctx.out_scale_sf is not None),
                 metadata=metadata,
                 attention_mask=ctx.attention_mask,
                 use_paged_context_fmha=use_paged_context_fmha,
                 is_mla_enable=self.is_mla_enable,
                 is_gen_only=is_gen_only,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1586 - 1599,
The current check for auto-allocation only treats the presence of ctx.out_scale
as an indicator for quantized output, but it should also consider
ctx.out_scale_sf to correctly handle NVFP4 paths. In the code section around the
output allocation logic using self.create_output, update the condition for
is_quantize_output to check if either ctx.out_scale or ctx.out_scale_sf is not
None. This ensures the NVFP4 quantized output path is properly triggered during
auto-allocation.
tensorrt_llm/_torch/attention_backend/vanilla.py (1)

308-318: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass ctx.attention_window_size through the no-cache path.

The cached branch honors the context field via _single_request_forward() (line 429), but no_kv_cache_forward() (line 308) does not accept it. Line 397 only passes attention_mask, dropping ctx.attention_window_size. Consequently, flash_attn_varlen_func() runs with the commented-out infinite window (window_size=(-1, -1)) instead of the configured window size, causing sliding-window no-cache requests to produce full-causal outputs instead of local attention.

To fix, add attention_window_size: Optional[int] = None to no_kv_cache_forward() signature, pass it from forward() (line 397), and use it in the flash_attn_varlen_func() call (line 359) with window_size=(attention_window_size - 1, 0) when set.

Also applies to: 383–397

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/vanilla.py` around lines 308 - 318, The
no-cache path must accept and propagate the context window size: add an optional
parameter attention_window_size: Optional[int] = None to no_kv_cache_forward and
update forward to pass ctx.attention_window_size into that call (matching how
_single_request_forward uses it); then, in the flash_attn_varlen_func invocation
inside no_kv_cache_forward, replace the current default/infinite window with
window_size=(attention_window_size - 1, 0) when attention_window_size is set
(otherwise keep the existing fallback), ensuring flash_attn_varlen_func receives
the correct sliding-window size for local attention.
🧹 Nitpick comments (1)
tests/unittest/_torch/attention/sparse/test_rocketkv.py (1)

263-266: ⚡ Quick win

Add one assertion for a non-default context path.

Both new calls pass AttentionForwardContext() with all defaults, so they only verify signature plumbing. They would still pass if a non-default field were dropped during propagation, or if merge_attention_forward_context(...) regressed on its error paths. A small backend-level test that goes through TrtllmAttention.forward (or the merge helper directly) with one non-default field and one mixed ctx + legacy-kwargs rejection case would cover the risky part of this refactor much better.

As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

Also applies to: 529-531

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention/sparse/test_rocketkv.py` around lines 263 -
266, Add a test that exercises a non-default AttentionForwardContext and the
mixed-ctx/legacy kwargs rejection: construct an AttentionForwardContext with at
least one non-default field set (e.g., a non-empty torch device/flag or an
explicit timestamp/seed field used by merge_attention_forward_context), call
trtllm_attn.sparse_kv_predict (or TrtllmAttention.forward) with that ctx and
assert the propagated/merged values are preserved, then add a second case that
passes both a ctx and a legacy kwarg (to trigger merge_attention_forward_context
rejection) and assert it raises the expected error; reference
AttentionForwardContext, trtllm_attn.sparse_kv_predict, TrtllmAttention.forward,
and merge_attention_forward_context to locate where to modify tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/flashinfer.py`:
- Around line 756-759: Replace the assert with an explicit runtime check that
raises a ValueError when ctx.attention_mask == CustomAttentionMask.CUSTOM and
attention_mask_data is None: in the block around attention_mask_data,
ctx.attention_mask, CustomAttentionMask.CUSTOM, and attention_mask_type (set to
int(AttentionMaskType.custom_mask)), change the assert to an if-statement that
raises ValueError("attention_mask_data is required for custom attention mask.")
so validation is preserved even under -O optimization.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1132-1137: Ensure the cached RoPE max-position metadata is updated
when growing the table: in _ensure_rope_table_size (after updating
self.rope_params.max_positions and recreating
self.rotary_inv_freq/self.rotary_cos_sin via
self.rope_params.create_rope_const_params()) also assign the new max value to
self.rotary_embedding_max_positions so _run() will forward the correct, in-sync
max-position metadata; update any related code paths that rely on
rotary_embedding_max_positions to use this refreshed value.

---

Outside diff comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 1942-1993: The MLA RoPE path in mla_rope_append_paged_kv_assign_q
reads self.rotary_cos_sin directly and can launch the DSA kernel with an
undersized cos/sin buffer when max_seq_len has grown; before calling
torch.ops.trtllm.mla_rope_append_paged_kv_assign_q, call the existing rope table
grow helper (e.g. self._ensure_rope_table_size(max_seq_len) or the equivalent
method used by TrtllmAttention.mla_rope_append_paged_kv_assign_q) using the
computed max_seq_len so self.rotary_cos_sin is large enough, keeping the rest of
the call and parameters unchanged.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1586-1599: The current check for auto-allocation only treats the
presence of ctx.out_scale as an indicator for quantized output, but it should
also consider ctx.out_scale_sf to correctly handle NVFP4 paths. In the code
section around the output allocation logic using self.create_output, update the
condition for is_quantize_output to check if either ctx.out_scale or
ctx.out_scale_sf is not None. This ensures the NVFP4 quantized output path is
properly triggered during auto-allocation.

In `@tensorrt_llm/_torch/attention_backend/vanilla.py`:
- Around line 308-318: The no-cache path must accept and propagate the context
window size: add an optional parameter attention_window_size: Optional[int] =
None to no_kv_cache_forward and update forward to pass ctx.attention_window_size
into that call (matching how _single_request_forward uses it); then, in the
flash_attn_varlen_func invocation inside no_kv_cache_forward, replace the
current default/infinite window with window_size=(attention_window_size - 1, 0)
when attention_window_size is set (otherwise keep the existing fallback),
ensuring flash_attn_varlen_func receives the correct sliding-window size for
local attention.

---

Nitpick comments:
In `@tests/unittest/_torch/attention/sparse/test_rocketkv.py`:
- Around line 263-266: Add a test that exercises a non-default
AttentionForwardContext and the mixed-ctx/legacy kwargs rejection: construct an
AttentionForwardContext with at least one non-default field set (e.g., a
non-empty torch device/flag or an explicit timestamp/seed field used by
merge_attention_forward_context), call trtllm_attn.sparse_kv_predict (or
TrtllmAttention.forward) with that ctx and assert the propagated/merged values
are preserved, then add a second case that passes both a ctx and a legacy kwarg
(to trigger merge_attention_forward_context rejection) and assert it raises the
expected error; reference AttentionForwardContext,
trtllm_attn.sparse_kv_predict, TrtllmAttention.forward, and
merge_attention_forward_context to locate where to modify tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9148f771-7dc2-4722-9d12-4598321aa4e0

📥 Commits

Reviewing files that changed from the base of the PR and between ad2fc22 and 81d33fe.

📒 Files selected for processing (11)
  • tensorrt_llm/_torch/attention_backend/__init__.py
  • tensorrt_llm/_torch/attention_backend/flashinfer.py
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • tensorrt_llm/_torch/attention_backend/sparse/rocket.py
  • tensorrt_llm/_torch/attention_backend/star_flashinfer.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/attention_backend/vanilla.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/unittest/_torch/attention/sparse/test_rocketkv.py
  • tests/unittest/_torch/attention/sparse/test_sparse_attention.py

Comment thread tensorrt_llm/_torch/attention_backend/flashinfer.py Outdated
Comment thread tensorrt_llm/_torch/attention_backend/trtllm.py
@yuxianq yuxianq requested a review from juney-nvidia May 5, 2026 07:38
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 5, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

Copy link
Copy Markdown
Collaborator

@yihwang-nv yihwang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46776 [ run ] triggered by Bot. Commit: 39b591c Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq requested review from a team as code owners May 5, 2026 09:07
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 5, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

yuxianq added 2 commits May 5, 2026 09:14
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 5, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46786 [ run ] triggered by Bot. Commit: d399538 Link to invocation

Comment thread tensorrt_llm/_torch/attention_backend/interface.py
Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 6, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46952 [ run ] triggered by Bot. Commit: 32d6532 Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq force-pushed the attention-forward-context branch from 32d6532 to 71c0dd4 Compare May 6, 2026 08:18
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 6, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46956 [ run ] triggered by Bot. Commit: 71c0dd4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46952 [ run ] completed with state ABORTED. Commit: 32d6532

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46956 [ run ] completed with state SUCCESS. Commit: 71c0dd4
/LLM/main/L0_MergeRequest_PR pipeline #36950 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 6, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

1 similar comment
@tburt-nv
Copy link
Copy Markdown
Collaborator

tburt-nv commented May 6, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47037 [ run ] triggered by Bot. Commit: 71c0dd4 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants