[None][perf] Add AutoDeploy NVFP4 RMSNorm quant fusion by tcherckez-nvidia · Pull Request #14361 · NVIDIA/TensorRT-LLM

tcherckez-nvidia · 2026-05-20T12:02:02Z

Summary by CodeRabbit

New Features
- Added NVFP4 quantization support for model deployment
- Introduced fused RMSNorm with quantization operators
- Added distributed all-reduce fusion with normalization and quantization
- Enhanced multi-stream utilities for complex tensor outputs

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-20T12:09:17Z

📝 Walkthrough

Walkthrough

This PR extends TensorRT-LLM's AutoDeploy system with NVFP4 quantization fusion capabilities. It adds custom kernel wrappers for fused RMSNorm + NVFP4 quantization (in normalization and distributed contexts), a graph transformation pass to fuse compatible patterns, supporting utilities for dtype-cast and nested-tensor handling, and comprehensive test coverage. The existing FuseRMSNormQuantFP8 transform is refactored to use shared utilities.

Changes

NVFP4 Quantization Fusion

Layer / File(s)	Summary
Node and stream utility enhancements `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`, `tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py`	Added `is_dtype_cast_op`, extended `is_trivial_passthrough_user` and `collect_terminal_users_through_passthrough` with `allow_dtype_cast` flag, introduced `unwrap_input_through_passthrough` for backtracking through view/dtype-cast chains. Generalized `begin_aux_stream_passthrough` to handle nested tensor structures via new `_record_stream_for_tensor_outputs` recursive helper.
FuseRMSNormQuantFP8 refactoring `tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_fp8.py`	Updated transform to use `unwrap_input_through_passthrough` instead of removed view-unwrapping helpers (`_unwrap_post_norm_nodes`, `_is_view_like`); applied in grouped-consumer collection, norm derivation, and FP8 linear rewiring.
NVFP4 normalization custom ops `tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py`	Added `trtllm_fused_gated_rmsnorm_quant_nvfp4`, `trtllm_fused_add_rmsnorm_quant_nvfp4`, and `trtllm_fused_add_rmsnorm_out_quant_nvfp4` with dtype selection, 2D reshaping, kernel dispatch, and fake/meta implementations. Shared helper `_run_trtllm_fused_add_rmsnorm_quant_nvfp4` handles residual addition, dtype alignment, and NVFP4 output reshaping.
NVFP4 distributed custom ops `tensorrt_llm/_torch/auto_deploy/custom_ops/distributed/trtllm_dist.py`	Added `trtllm_fused_allreduce_residual_rmsnorm_quant_nvfp4` and `trtllm_fused_allreduce_residual_rmsnorm_out_quant_nvfp4` MPI-mode fused kernels with AllReduceParams/AllReduceFusionOp wiring and corresponding fake implementations using `fp4_utils.get_fp4_shape`.
NVFP4 fusion transformation `tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_nvfp4.py`	Implemented `FuseRMSNormQuantNVFP4` transform that detects NVFP4 linear patterns, groups users by scale/producer constraints, fuses allreduce/residual/add/gated RMSNorm patterns into corresponding fused quant ops (handling tuple outputs), replaces NVFP4 linear consumers with prequant GEMM nodes, reapplies post-norm view/reshape operations, and recompiles graphs.
Configuration `tensorrt_llm/_torch/auto_deploy/config/default.yaml`, `examples/auto_deploy/model_registry/configs/ultra_v3.yaml`	Registered `fuse_rmsnorm_quant_nvfp4` in default post_load_fusion pipeline (disabled by default, requires shape propagation); enabled in ultra_v3.yaml model config.
NVFP4 fusion test coverage `tests/unittest/auto_deploy/singlegpu/transformations/library/test_quant_fusion.py`	Added 12 test cases validating NVFP4 gated/add/allreduce RMSNorm fusion: CUDA correctness, basic/dtype-cast/reshape rewrites, invalid-shape skipping, mixed-consumer norm retention. Includes reference module `TinyGatedRMSNormQuantNVFP4Linear` with input-scale calibration, graph root builder, and op-count utilities.
Tuple-fork MoE test `tests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_moe.py`	Added mock tuple-producing custom op and `MockTupleForkNemotronHMoELayer` validating that shared-expert transform and `begin_aux_stream_passthrough` correctly handle nested tensor outputs from tuple-fork patterns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#13658: Adds Nemotron Ultra V3 AutoDeploy accuracy coverage via ultra_v3.yaml config, which directly overlaps with this PR's model-registry config updates enabling the new fusion transform.

Suggested labels

AutoDeploy

Suggested reviewers

MrGeva
galagam
govind-ramnarayan
nv-guomingz

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely a template with no actual content provided—all sections (Description, Test Coverage, PR Checklist items) contain only placeholder comments with no substantive information.	Fill in the Description section explaining what NVFP4 RMSNorm quant fusion does and why it was added. Document which test cases validate the changes under Test Coverage.
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.74% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding NVFP4 RMSNorm quantization fusion to AutoDeploy, which aligns with the extensive file changes implementing this feature.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py (1)
303-317: ⚡ Quick win

Centralize the NVFP4 fake-shape calculation.

The same sf_size math is encoded three times here, while the distributed NVFP4 fake ops already use a shared fp4_utils.get_fp4_shape(...) helper. Pulling this into one helper keeps the fake/meta contract from drifting when the packing layout changes.

Also applies to: 374-391, 413-431
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py` around
lines 303 - 317, The fake NVFP4 ops duplicate the sf_size packing math; extract
that calculation into the shared helper (e.g. fp4_utils.get_fp4_shape) and call
it from each fake/meta function instead of reimplementing the formula. Replace
the inline sf_size computation inside
_trtllm_fused_gated_rmsnorm_quant_nvfp4_fake (and the analogous functions at the
other ranges) with a call that passes the same inputs used to compute sf_size (m
and x.shape[-1] or the full tensor shape as the helper expects) and use the
helper's return for the second returned tensor shape so all NVFP4 fake-shape
logic is centralized and consistent with existing distributed NVFP4 ops.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 716-737: The unwrap_input_through_passthrough function only
recognizes call_function view ops via is_any_view_op, so method-style views
(call_method nodes for 'view','reshape','transpose','permute','contiguous') stop
the backward walk; update the loop to also treat call_method view ops as
passthroughs by extending the condition to allow nodes with current.target ==
"call_method" whose current.args[0] is a Node and whose method name is one of
the view methods (or add a helper like is_call_method_view_op that checks
current.target and current.args[0] and the method string), keep the existing
allow_dtype_cast handling, append such call_method nodes to post_nodes and
continue walking into current.args[0] so the returned (current, post_nodes)
ordering remains from consumer back toward the producer.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_moe.py`:
- Around line 691-697: The test collects begin_nodes by filtering gm.graph.nodes
for op "call_function" with target begin_aux_stream_passthrough but then indexes
begin_nodes[0] without checking there is at least one node; add an explicit
assertion (e.g., assert len(begin_nodes) == 1 or >= 1) before accessing
begin_nodes[0] to fail loudly with a clear message if the expected passthrough
node is missing; keep the following check that begin_nodes[0].args[0].target is
torch.ops.auto_deploy.mock_tuple_fork_moe_test.default and then call
_assert_numerical_correctness(gm, model, torch.randn(...)) as before.

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py`:
- Around line 303-317: The fake NVFP4 ops duplicate the sf_size packing math;
extract that calculation into the shared helper (e.g. fp4_utils.get_fp4_shape)
and call it from each fake/meta function instead of reimplementing the formula.
Replace the inline sf_size computation inside
_trtllm_fused_gated_rmsnorm_quant_nvfp4_fake (and the analogous functions at the
other ranges) with a call that passes the same inputs used to compute sf_size (m
and x.shape[-1] or the full tensor shape as the helper expects) and use the
helper's return for the second returned tensor shape so all NVFP4 fake-shape
logic is centralized and consistent with existing distributed NVFP4 ops.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 88999f4f-29e5-4726-ba5c-ce32009797d1

📥 Commits

Reviewing files that changed from the base of the PR and between a173761 and 85ae477.

📒 Files selected for processing (10)

examples/auto_deploy/model_registry/configs/ultra_v3.yaml
tensorrt_llm/_torch/auto_deploy/config/default.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/distributed/trtllm_dist.py
tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_fp8.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_nvfp4.py
tensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.py
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
tests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_moe.py
tests/unittest/auto_deploy/singlegpu/transformations/library/test_quant_fusion.py

tcherckez-nvidia · 2026-05-20T12:15:06Z

/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-05-20T12:21:42Z

PR_Github #49429 [ run ] triggered by Bot. Commit: 69c74fa Link to invocation

MrGeva

codex found some stuff worth fixing

tensorrt-cicd · 2026-05-20T15:27:09Z

PR_Github #49429 [ run ] completed with state SUCCESS. Commit: 69c74fa
/LLM/main/L0_MergeRequest_PR pipeline #39073 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>

Signed-off-by: Tal Cherckez <tcherckez@nvidia.com>

tcherckez-nvidia · 2026-05-24T09:48:48Z

/bot run

tcherckez-nvidia · 2026-05-24T09:49:47Z

/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-24T09:54:10Z

PR_Github #50093 [ run ] triggered by Bot. Commit: 302b48d Link to invocation

tensorrt-cicd · 2026-05-24T09:55:38Z

PR_Github #50094 [ run ] triggered by Bot. Commit: 302b48d Link to invocation

tensorrt-cicd · 2026-05-24T13:09:20Z

PR_Github #50094 [ run ] completed with state SUCCESS. Commit: 302b48d
/LLM/main/L0_MergeRequest_PR pipeline #39647 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

galagam

Overall looks good, please see comments.

Move view and dtype passthrough handling into shared node utilities, including method-style view ops used around RMSNorm outputs. Set packed NVFP4 output and scale metadata explicitly so the transform no longer needs its own pre-shape-prop pass. Split the fusion dispatcher into allreduce, add, and gated RMSNorm handlers and extend quant-fusion tests for passthrough traversal and metadata. Signed-off-by: Tal Cherckez <tcherckez@nvidia.com>

tcherckez-nvidia · 2026-05-27T09:13:54Z

/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-27T09:20:37Z

PR_Github #50513 [ run ] triggered by Bot. Commit: e28e206 Link to invocation

tensorrt-cicd · 2026-05-27T11:51:39Z

PR_Github #50513 [ run ] completed with state SUCCESS. Commit: e28e206
/LLM/main/L0_MergeRequest_PR pipeline #40021 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

tcherckez-nvidia · 2026-05-27T13:06:00Z

/bot run

tensorrt-cicd · 2026-05-27T13:11:56Z

PR_Github #50540 [ run ] triggered by Bot. Commit: e28e206 Link to invocation

tensorrt-cicd · 2026-05-27T16:40:56Z

PR_Github #50540 [ run ] completed with state SUCCESS. Commit: e28e206
/LLM/main/L0_MergeRequest_PR pipeline #40045 completed with status: 'SUCCESS'

CI Report

Link to invocation

tcherckez-nvidia requested a review from a team as a code owner May 20, 2026 12:02

tcherckez-nvidia requested a review from MrGeva May 20, 2026 12:02

github-actions Bot assigned tcherckez-nvidia May 20, 2026

tcherckez-nvidia force-pushed the ad-nvfp4-rmsnorm-quant-fusion branch from 85ae477 to 18bd572 Compare May 20, 2026 12:08

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py Outdated

Comment thread tests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_moe.py

tcherckez-nvidia force-pushed the ad-nvfp4-rmsnorm-quant-fusion branch from 18bd572 to 69c74fa Compare May 20, 2026 12:13

MrGeva reviewed May 20, 2026

View reviewed changes

tcherckez-nvidia and others added 2 commits May 23, 2026 23:58

[None][perf] Add AutoDeploy NVFP4 RMSNorm quant fusion

3f5de20

Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>

Fix NVFP4 RMSNorm quant fusion

302b48d

Signed-off-by: Tal Cherckez <tcherckez@nvidia.com>

tcherckez-nvidia force-pushed the ad-nvfp4-rmsnorm-quant-fusion branch from 69c74fa to 302b48d Compare May 24, 2026 09:45

galagam approved these changes May 24, 2026

View reviewed changes

tcherckez-nvidia enabled auto-merge (squash) May 27, 2026 13:06

tcherckez-nvidia merged commit 56be625 into NVIDIA:main May 28, 2026
8 checks passed

Conversation

tcherckez-nvidia commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tcherckez-nvidia commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

MrGeva left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tcherckez-nvidia commented May 24, 2026

Uh oh!

tcherckez-nvidia commented May 24, 2026

Uh oh!

tensorrt-cicd commented May 24, 2026

Uh oh!

tensorrt-cicd commented May 24, 2026

Uh oh!

tensorrt-cicd commented May 24, 2026

Uh oh!

galagam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tcherckez-nvidia commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tcherckez-nvidia commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tcherckez-nvidia commented May 20, 2026 •

edited by coderabbitai Bot

Loading