[None][perf] Add AutoDeploy NVFP4 RMSNorm quant fusion#14361
Conversation
85ae477 to
18bd572
Compare
📝 WalkthroughWalkthroughThis PR extends TensorRT-LLM's AutoDeploy system with NVFP4 quantization fusion capabilities. It adds custom kernel wrappers for fused RMSNorm + NVFP4 quantization (in normalization and distributed contexts), a graph transformation pass to fuse compatible patterns, supporting utilities for dtype-cast and nested-tensor handling, and comprehensive test coverage. The existing FuseRMSNormQuantFP8 transform is refactored to use shared utilities. ChangesNVFP4 Quantization Fusion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py (1)
303-317: ⚡ Quick winCentralize the NVFP4 fake-shape calculation.
The same
sf_sizemath is encoded three times here, while the distributed NVFP4 fake ops already use a sharedfp4_utils.get_fp4_shape(...)helper. Pulling this into one helper keeps the fake/meta contract from drifting when the packing layout changes.Also applies to: 374-391, 413-431
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py` around lines 303 - 317, The fake NVFP4 ops duplicate the sf_size packing math; extract that calculation into the shared helper (e.g. fp4_utils.get_fp4_shape) and call it from each fake/meta function instead of reimplementing the formula. Replace the inline sf_size computation inside _trtllm_fused_gated_rmsnorm_quant_nvfp4_fake (and the analogous functions at the other ranges) with a call that passes the same inputs used to compute sf_size (m and x.shape[-1] or the full tensor shape as the helper expects) and use the helper's return for the second returned tensor shape so all NVFP4 fake-shape logic is centralized and consistent with existing distributed NVFP4 ops.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 716-737: The unwrap_input_through_passthrough function only
recognizes call_function view ops via is_any_view_op, so method-style views
(call_method nodes for 'view','reshape','transpose','permute','contiguous') stop
the backward walk; update the loop to also treat call_method view ops as
passthroughs by extending the condition to allow nodes with current.target ==
"call_method" whose current.args[0] is a Node and whose method name is one of
the view methods (or add a helper like is_call_method_view_op that checks
current.target and current.args[0] and the method string), keep the existing
allow_dtype_cast handling, append such call_method nodes to post_nodes and
continue walking into current.args[0] so the returned (current, post_nodes)
ordering remains from consumer back toward the producer.
In `@tests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_moe.py`:
- Around line 691-697: The test collects begin_nodes by filtering gm.graph.nodes
for op "call_function" with target begin_aux_stream_passthrough but then indexes
begin_nodes[0] without checking there is at least one node; add an explicit
assertion (e.g., assert len(begin_nodes) == 1 or >= 1) before accessing
begin_nodes[0] to fail loudly with a clear message if the expected passthrough
node is missing; keep the following check that begin_nodes[0].args[0].target is
torch.ops.auto_deploy.mock_tuple_fork_moe_test.default and then call
_assert_numerical_correctness(gm, model, torch.randn(...)) as before.
---
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py`:
- Around line 303-317: The fake NVFP4 ops duplicate the sf_size packing math;
extract that calculation into the shared helper (e.g. fp4_utils.get_fp4_shape)
and call it from each fake/meta function instead of reimplementing the formula.
Replace the inline sf_size computation inside
_trtllm_fused_gated_rmsnorm_quant_nvfp4_fake (and the analogous functions at the
other ranges) with a call that passes the same inputs used to compute sf_size (m
and x.shape[-1] or the full tensor shape as the helper expects) and use the
helper's return for the second returned tensor shape so all NVFP4 fake-shape
logic is centralized and consistent with existing distributed NVFP4 ops.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 88999f4f-29e5-4726-ba5c-ce32009797d1
📒 Files selected for processing (10)
examples/auto_deploy/model_registry/configs/ultra_v3.yamltensorrt_llm/_torch/auto_deploy/config/default.yamltensorrt_llm/_torch/auto_deploy/custom_ops/distributed/trtllm_dist.pytensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_fp8.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rmsnorm_quant_nvfp4.pytensorrt_llm/_torch/auto_deploy/utils/multi_stream_utils.pytensorrt_llm/_torch/auto_deploy/utils/node_utils.pytests/unittest/auto_deploy/singlegpu/custom_ops/test_multi_stream_moe.pytests/unittest/auto_deploy/singlegpu/transformations/library/test_quant_fusion.py
18bd572 to
69c74fa
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #49429 [ run ] triggered by Bot. Commit: |
MrGeva
left a comment
There was a problem hiding this comment.
codex found some stuff worth fixing
|
PR_Github #49429 [ run ] completed with state |
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
Signed-off-by: Tal Cherckez <tcherckez@nvidia.com>
69c74fa to
302b48d
Compare
|
/bot run |
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #50093 [ run ] triggered by Bot. Commit: |
|
PR_Github #50094 [ run ] triggered by Bot. Commit: |
|
PR_Github #50094 [ run ] completed with state |
galagam
left a comment
There was a problem hiding this comment.
Overall looks good, please see comments.
Move view and dtype passthrough handling into shared node utilities, including method-style view ops used around RMSNorm outputs. Set packed NVFP4 output and scale metadata explicitly so the transform no longer needs its own pre-shape-prop pass. Split the fusion dispatcher into allreduce, add, and gated RMSNorm handlers and extend quant-fusion tests for passthrough traversal and metadata. Signed-off-by: Tal Cherckez <tcherckez@nvidia.com>
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #50513 [ run ] triggered by Bot. Commit: |
|
PR_Github #50513 [ run ] completed with state |
|
/bot run |
|
PR_Github #50540 [ run ] triggered by Bot. Commit: |
|
PR_Github #50540 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.