[None][perf] Improve TRTLLM MoE autotune in DEP#13667
[None][perf] Improve TRTLLM MoE autotune in DEP#13667rosenrodt wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
/bot run |
|
PR_Github #46394 [ run ] triggered by Bot. Commit: |
|
PR_Github #46394 [ run ] completed with state
|
by localizing expert distribution to local experts during autotune use_dp=True — mimics load-balanced DEP (Data + Expert Parallel + Attention DP). The system globally holds `ep_size * runtime_max_tokens_per_rank` tokens which A2A perfectly distributes across the ep_size ranks, so each rank's autotune dummy carries `runtime_max_tokens_per_rank` rows whose top_k slots all live in the local expert shard `[local_expert_offset, local_expert_offset + local_num_experts)`. m_local per local expert = `num_tokens * top_k / local_num_experts`. use_dp=False — pure EP (no Attention DP). The system never holds `ep_size * runtime_max_tokens_per_rank` tokens; each rank sees its own `runtime_max_tokens_per_rank` rows whose top_k slots span global experts, so only ~1/ep_size of slots hit local experts. m_local per local expert = `num_tokens * top_k / num_experts`. Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
7ebca22 to
665506c
Compare
|
/bot run (was blocked by now waived test since #13685) |
|
PR_Github #46586 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe PR extends MoE autotuning infrastructure to support configurable dummy top-k generation modes (balanced or random) via environment variable, and propagates new ChangesMoE Autotuning with Configurable Dummy Distributions
Sequence DiagramsequenceDiagram
participant MoEModule as TRTLLMGenFusedMoE
participant Backend as MoEOpBackend
participant Runner as MoERunner
participant Autotuner as prepare_dummy_topk_and_hook
participant EnvConfig as Env / Distribution Mode
MoEModule->>Backend: run_fp8/fp4_block_scale_moe(tune_max_num_tokens, use_dp)
Backend->>Runner: construct with tune_max_num_tokens, use_dp
Runner->>Runner: get_dynamic_tensor_specs(use_dp) → adjusted buckets
Runner->>Runner: get_tuning_config(tune_max_num_tokens, use_dp)
Runner->>Autotuner: prepare_dummy_topk_and_hook(local_num_experts, use_dp, ...)
Autotuner->>EnvConfig: read TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION
EnvConfig-->>Autotuner: balanced | random
alt Balanced Mode
Autotuner->>Autotuner: make_balanced_dummy_topk(local_num_experts, offset)
else Random Mode
Autotuner->>Autotuner: random dummy top-k (local or global)
end
Autotuner-->>Runner: dummy_topk, hook callback
Runner-->>Backend: tuned parameters
Backend->>Backend: use_dp → forward to torch.ops.trtllm
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)
1-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA SPDX header to this modified file.
This Python source file is missing the repository-standard copyright/license block, so it will fail the header requirement applied to modified source files.
As per coding guidelines,
All source files (.cpp, .h, .cu, .py) should contain an NVIDIA copyright header with the year of latest modification and Apache 2.0 license notice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` around lines 1 - 18, Add the standard NVIDIA SPDX copyright/license header (including the current year and Apache-2.0 notice) to the top of this Python module so it meets the repository header requirement; insert the header above the existing imports in tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (the file defining symbols like _MOE_AUTOTUNE_DUMMY_DISTRIBUTION_ENV, ActType_TrtllmGen, Fp4QuantizedTensor, AutoTuner, etc.), ensuring the header matches the project’s canonical format used for .py files and includes the SPDX identifier and full Apache 2.0 license notice.
421-430:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude the new tuning regime in the autotuner cache key.
Line 426 says
unique_id()feeds the autotuner cache key, but these runners now tune against different workloads based onuse_dpandtune_max_num_tokens. With the current key, a tactic warmed up in pure-EP can be reused for DEP (or vice versa) and skip the retune this PR is trying to introduce.Also applies to: 832-842, 1163-1179, 1504-1520, 1827-1843, 2142-2153
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
625-679:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThread the new autotune args through the remaining TRTLLM custom-op branches.
The backend-mediated fp8/fp4 paths now pass
tune_max_num_tokensanduse_dp, but the direct custom-op calls at Line 696 and Line 735 still rely on the default values. That leaves thehas_w4a8_nvfp4_fp8andhas_w4a8_mxfp4_fp8paths tuning against the old 8192/non-DP assumptions.Suggested fix
outputs = torch.ops.trtllm.fp8_fp4_block_scale_moe_runner( router_logits, routing_bias, x, @@ act_type=0, topk_weights=token_final_scales, topk_ids=token_selected_experts, output=moe_output, + tune_max_num_tokens=self.max_num_tokens, + use_dp=self.use_dp, ) @@ result = torch.ops.trtllm.e4m3_mxe2m1_block_scale_moe_runner( router_logits, routing_bias, x, @@ 0, # act_type token_final_scales, token_selected_experts, output=moe_output, + tune_max_num_tokens=self.max_num_tokens, + use_dp=self.use_dp, )Also applies to: 696-766
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines 625 - 679, The fp4/fp8 custom-op branches for has_w4a8_nvfp4_fp8 and has_w4a8_mxfp4_fp8 need to pass the new autotuner args; in the call sites that invoke the TRTLLM custom ops (the calls analogous to op_backend.run_fp4_block_scale_moe used in the has_nvfp4/... branch), add tune_max_num_tokens=self.max_num_tokens and use_dp=self.use_dp to the argument list so those paths use the same token limit and DP flag as the backend-mediated paths; locate the calls inside the has_w4a8_nvfp4_fp8 and has_w4a8_mxfp4_fp8 branches and append those two named parameters to each op_backend.run_* invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`:
- Around line 1-18: Add the standard NVIDIA SPDX copyright/license header
(including the current year and Apache-2.0 notice) to the top of this Python
module so it meets the repository header requirement; insert the header above
the existing imports in tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
(the file defining symbols like _MOE_AUTOTUNE_DUMMY_DISTRIBUTION_ENV,
ActType_TrtllmGen, Fp4QuantizedTensor, AutoTuner, etc.), ensuring the header
matches the project’s canonical format used for .py files and includes the SPDX
identifier and full Apache 2.0 license notice.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 625-679: The fp4/fp8 custom-op branches for has_w4a8_nvfp4_fp8 and
has_w4a8_mxfp4_fp8 need to pass the new autotuner args; in the call sites that
invoke the TRTLLM custom ops (the calls analogous to
op_backend.run_fp4_block_scale_moe used in the has_nvfp4/... branch), add
tune_max_num_tokens=self.max_num_tokens and use_dp=self.use_dp to the argument
list so those paths use the same token limit and DP flag as the backend-mediated
paths; locate the calls inside the has_w4a8_nvfp4_fp8 and has_w4a8_mxfp4_fp8
branches and append those two named parameters to each op_backend.run_*
invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c61ea722-8ace-4bea-8bd0-44d94f70a233
📒 Files selected for processing (3)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py
|
PR_Github #46586 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46595 [ run ] triggered by Bot. Commit: |
|
PR_Github #46595 [ run ] completed with state
|
|
|
Waived in #13717. |
Summary by CodeRabbit
New Features
Description
This PR improves the TRTLLM MoE perf in DEP case. With DP (Attention DP + MoE EP), there are
ep_size * runtime_max_tokens_per_ranktotal tokens in the system. In an ideal world, each rank getsruntime_max_tokens_per_rank. Therefore, during autotune, we must ensure allruntime_max_tokens_per_ranktokens after topk expansion are local to each rank so we profile with the right workload amount. We do that by generate expert distribution in range[0, num_local_experts]and offset that byep_rank * num_local_experts.Without DP, the system processes
max_num_tokensamount of tokens across all EP ranks. Each rank would getmax_num_tokens / ep_sizeamount of tokens in an ideal world. To simulate the reduced amount of workload, each rank simply generates the expert distribution across all experts, and let MoE op itself filter out non-local part of the distribution.TPS/gpu of Qwen3-235B-A22B NVFP4 DEP2 1k/2k
TPS/gpu of DS-R1 NVFP4 DEP8 1k/2k
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.