[None][perf] Improve TRTLLM MoE autotune in DEP by rosenrodt · Pull Request #13667 · NVIDIA/TensorRT-LLM

rosenrodt · 2026-04-30T14:50:31Z

Summary by CodeRabbit

New Features

Added configurable MoE autotuning through environment variable support, enabling selection between balanced or random dummy expert distributions
Enabled data parallelism mode support for Mixture of Experts operations with automatic tensor specification bucketing adjustments
Introduced maximum token configuration capability for enhanced MoE autotuning control and optimization

Description

This PR improves the TRTLLM MoE perf in DEP case. With DP (Attention DP + MoE EP), there are ep_size * runtime_max_tokens_per_rank total tokens in the system. In an ideal world, each rank gets runtime_max_tokens_per_rank. Therefore, during autotune, we must ensure all runtime_max_tokens_per_rank tokens after topk expansion are local to each rank so we profile with the right workload amount. We do that by generate expert distribution in range [0, num_local_experts] and offset that by ep_rank * num_local_experts.

Without DP, the system processes max_num_tokens amount of tokens across all EP ranks. Each rank would get max_num_tokens / ep_size amount of tokens in an ideal world. To simulate the reduced amount of workload, each rank simply generates the expert distribution across all experts, and let MoE op itself filter out non-local part of the distribution.

TPS/gpu of Qwen3-235B-A22B NVFP4 DEP2 1k/2k

bsz	baseline	new(this PR + #12440)	Δ
8	340.9	347.5	+1.94%
32	1186.1	1186.2	+0.01%
64	2017.8	2069.1	+2.54%
128	3335.6	3404.6	+2.07%
256	5335.9	5361.7	+0.48%
512	7157.5	8151.9	+13.89%
1024	3724.8	3655.2	−1.87%
2048	3869.3	3615.9	−6.55%

TPS/gpu of DS-R1 NVFP4 DEP8 1k/2k

bsz	baseline	new (this PR + #12440)	Δ
8	74.93	77.85	+3.90%
64	504.81	505.61	+0.16%
128	905.26	907.41	+0.24%
512	2700.5	2661.4	−1.45%
1024	3450.0	4213.5	+22.13%
2048	5244.4	5725.4	+9.17%
4096	3375.3	3429.1	+1.59%

Test Coverage

python -m pytest -s -v tests/unittest/_torch/modules/moe/test_moe_backend.py -k TRTLLM

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

rosenrodt · 2026-04-30T15:10:38Z

/bot run

tensorrt-cicd · 2026-04-30T15:18:25Z

PR_Github #46394 [ run ] triggered by Bot. Commit: 7ebca22 Link to invocation

tensorrt-cicd · 2026-05-01T03:20:02Z

PR_Github #46394 [ run ] completed with state SUCCESS. Commit: 7ebca22
/LLM/main/L0_MergeRequest_PR pipeline #36473 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

by localizing expert distribution to local experts during autotune use_dp=True — mimics load-balanced DEP (Data + Expert Parallel + Attention DP). The system globally holds `ep_size * runtime_max_tokens_per_rank` tokens which A2A perfectly distributes across the ep_size ranks, so each rank's autotune dummy carries `runtime_max_tokens_per_rank` rows whose top_k slots all live in the local expert shard `[local_expert_offset, local_expert_offset + local_num_experts)`. m_local per local expert = `num_tokens * top_k / local_num_experts`. use_dp=False — pure EP (no Attention DP). The system never holds `ep_size * runtime_max_tokens_per_rank` tokens; each rank sees its own `runtime_max_tokens_per_rank` rows whose top_k slots span global experts, so only ~1/ep_size of slots hit local experts. m_local per local expert = `num_tokens * top_k / num_experts`. Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt · 2026-05-03T07:43:12Z

/bot run

(was blocked by now waived test since #13685)

tensorrt-cicd · 2026-05-03T07:48:57Z

PR_Github #46586 [ run ] triggered by Bot. Commit: 665506c Link to invocation

coderabbitai · 2026-05-03T07:49:45Z

📝 Walkthrough

Walkthrough

The PR extends MoE autotuning infrastructure to support configurable dummy top-k generation modes (balanced or random) via environment variable, and propagates new tune_max_num_tokens and use_dp parameters through runner classes, custom ops, backend interfaces, and module-level integration to enable data-parallel and token-size-aware tuning configuration.

Changes

MoE Autotuning with Configurable Dummy Distributions

Layer / File(s)	Summary
Configuration & Constants `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` (lines 17–53)	Environment variable `TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION` added with supported modes (`balanced`, `random`); module-level comments document how distribution shape depends on `use_dp`.
Autotuning Core Logic `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` (lines 55–325)	`prepare_dummy_topk_and_hook` refactored to accept `local_num_experts`, `local_expert_offset`, and `use_dp`; new balanced and random dummy top-k generation functions; autotuner helper recreates dummy tensors using selected regime on token-shape changes.
Runner Classes & Custom Ops `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` (lines 394–2421)	Six MoE runner classes (`FP4BlockScaleMoERunner`, `FP8BlockScaleMoERunner`, `MxE4m3MxE2m1BlockScaleMoERunner`, `E4m3MxE2m1BlockScaleMoERunner`, `Bf16MxE2m1BlockScaleMoERunner`, `FP8FP4BlockScaleMoERunner`) extended to accept `tune_max_num_tokens` and `use_dp`; `get_dynamic_tensor_specs` and `get_tuning_config` updated to handle `use_dp`-dependent deflation and extra buckets; custom-op signatures and call sites updated to pass these parameters and invoke `prepare_dummy_topk_and_hook` with local/expert/distribution details.
Backend Interface `tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py`	`MoEOpBackend`, `TRTLLMOpBackend`, and `FlashinferOpBackend` method signatures updated to accept `use_dp: bool = False` in `run_fp8_block_scale_moe` and `run_fp4_block_scale_moe`; `TRTLLMOpBackend` forwards `use_dp` to corresponding `torch.ops.trtllm.*` runner calls.
Module-Level Integration `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`	`TRTLLMGenFusedMoE.__init__` caches `self.max_num_tokens` from model config; `run_moe()` passes `tune_max_num_tokens=self.max_num_tokens` and `use_dp=self.use_dp` into both `run_fp8_block_scale_moe` and `run_fp4_block_scale_moe` backend calls.

Sequence Diagram

sequenceDiagram
    participant MoEModule as TRTLLMGenFusedMoE
    participant Backend as MoEOpBackend
    participant Runner as MoERunner
    participant Autotuner as prepare_dummy_topk_and_hook
    participant EnvConfig as Env / Distribution Mode

    MoEModule->>Backend: run_fp8/fp4_block_scale_moe(tune_max_num_tokens, use_dp)
    Backend->>Runner: construct with tune_max_num_tokens, use_dp
    Runner->>Runner: get_dynamic_tensor_specs(use_dp) → adjusted buckets
    Runner->>Runner: get_tuning_config(tune_max_num_tokens, use_dp)
    Runner->>Autotuner: prepare_dummy_topk_and_hook(local_num_experts, use_dp, ...)
    Autotuner->>EnvConfig: read TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION
    EnvConfig-->>Autotuner: balanced | random
    alt Balanced Mode
        Autotuner->>Autotuner: make_balanced_dummy_topk(local_num_experts, offset)
    else Random Mode
        Autotuner->>Autotuner: random dummy top-k (local or global)
    end
    Autotuner-->>Runner: dummy_topk, hook callback
    Runner-->>Backend: tuned parameters
    Backend->>Backend: use_dp → forward to torch.ops.trtllm

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.96% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: improving TRTLLM MoE autotuning specifically for DEP (Data + Expert Parallel) scenarios, which aligns with the core objective of the PR.
Description check	✅ Passed	The PR description provides a comprehensive explanation of the issue, solution approach, performance results with detailed metrics, test coverage recommendations, and confirms the checklist. All required sections are present and well-documented.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)
1-18: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA SPDX header to this modified file.

This Python source file is missing the repository-standard copyright/license block, so it will fail the header requirement applied to modified source files.

As per coding guidelines, All source files (.cpp, .h, .cu, .py) should contain an NVIDIA copyright header with the year of latest modification and Apache 2.0 license notice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` around lines 1 - 18,
Add the standard NVIDIA SPDX copyright/license header (including the current
year and Apache-2.0 notice) to the top of this Python module so it meets the
repository header requirement; insert the header above the existing imports in
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (the file defining
symbols like _MOE_AUTOTUNE_DUMMY_DISTRIBUTION_ENV, ActType_TrtllmGen,
Fp4QuantizedTensor, AutoTuner, etc.), ensuring the header matches the project’s
canonical format used for .py files and includes the SPDX identifier and full
Apache 2.0 license notice.
421-430: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include the new tuning regime in the autotuner cache key.

Line 426 says unique_id() feeds the autotuner cache key, but these runners now tune against different workloads based on use_dp and tune_max_num_tokens. With the current key, a tactic warmed up in pure-EP can be reused for DEP (or vice versa) and skip the retune this PR is trying to introduce.

Also applies to: 832-842, 1163-1179, 1504-1520, 1827-1843, 2142-2153
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
625-679: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Thread the new autotune args through the remaining TRTLLM custom-op branches.

The backend-mediated fp8/fp4 paths now pass tune_max_num_tokens and use_dp, but the direct custom-op calls at Line 696 and Line 735 still rely on the default values. That leaves the has_w4a8_nvfp4_fp8 and has_w4a8_mxfp4_fp8 paths tuning against the old 8192/non-DP assumptions.
Suggested fix
             outputs = torch.ops.trtllm.fp8_fp4_block_scale_moe_runner(
                 router_logits,
                 routing_bias,
                 x,
@@
                 act_type=0,
                 topk_weights=token_final_scales,
                 topk_ids=token_selected_experts,
                 output=moe_output,
+                tune_max_num_tokens=self.max_num_tokens,
+                use_dp=self.use_dp,
             )
@@
             result = torch.ops.trtllm.e4m3_mxe2m1_block_scale_moe_runner(
                 router_logits,
                 routing_bias,
                 x,
@@
                 0,  # act_type
                 token_final_scales,
                 token_selected_experts,
                 output=moe_output,
+                tune_max_num_tokens=self.max_num_tokens,
+                use_dp=self.use_dp,
             )
Also applies to: 696-766
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines
625 - 679, The fp4/fp8 custom-op branches for has_w4a8_nvfp4_fp8 and
has_w4a8_mxfp4_fp8 need to pass the new autotuner args; in the call sites that
invoke the TRTLLM custom ops (the calls analogous to
op_backend.run_fp4_block_scale_moe used in the has_nvfp4/... branch), add
tune_max_num_tokens=self.max_num_tokens and use_dp=self.use_dp to the argument
list so those paths use the same token limit and DP flag as the backend-mediated
paths; locate the calls inside the has_w4a8_nvfp4_fp8 and has_w4a8_mxfp4_fp8
branches and append those two named parameters to each op_backend.run_*
invocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`:
- Around line 1-18: Add the standard NVIDIA SPDX copyright/license header
(including the current year and Apache-2.0 notice) to the top of this Python
module so it meets the repository header requirement; insert the header above
the existing imports in tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
(the file defining symbols like _MOE_AUTOTUNE_DUMMY_DISTRIBUTION_ENV,
ActType_TrtllmGen, Fp4QuantizedTensor, AutoTuner, etc.), ensuring the header
matches the project’s canonical format used for .py files and includes the SPDX
identifier and full Apache 2.0 license notice.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 625-679: The fp4/fp8 custom-op branches for has_w4a8_nvfp4_fp8 and
has_w4a8_mxfp4_fp8 need to pass the new autotuner args; in the call sites that
invoke the TRTLLM custom ops (the calls analogous to
op_backend.run_fp4_block_scale_moe used in the has_nvfp4/... branch), add
tune_max_num_tokens=self.max_num_tokens and use_dp=self.use_dp to the argument
list so those paths use the same token limit and DP flag as the backend-mediated
paths; locate the calls inside the has_w4a8_nvfp4_fp8 and has_w4a8_mxfp4_fp8
branches and append those two named parameters to each op_backend.run_*
invocation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c61ea722-8ace-4bea-8bd0-44d94f70a233

📥 Commits

Reviewing files that changed from the base of the PR and between 3721673 and 665506c.

📒 Files selected for processing (3)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py

tensorrt-cicd · 2026-05-03T11:06:38Z

PR_Github #46586 [ run ] completed with state SUCCESS. Commit: 665506c
/LLM/main/L0_MergeRequest_PR pipeline #36635 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

rosenrodt · 2026-05-03T12:45:55Z

/bot run

tensorrt-cicd · 2026-05-03T12:51:33Z

PR_Github #46595 [ run ] triggered by Bot. Commit: 665506c Link to invocation

tensorrt-cicd · 2026-05-03T16:31:02Z

PR_Github #46595 [ run ] completed with state SUCCESS. Commit: 665506c
/LLM/main/L0_MergeRequest_PR pipeline #36642 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

SimengLiu-nv · 2026-05-03T16:50:17Z

kv_cache/test_prefix_aware_scheduling.py::TestServePrefixAwareScheduling::test_multi_round_qa_shared_prefix[swa-chunked] is a new test I added. Trying to understand if the failure is related to the code changes or flaky. Will report back with findings.

SimengLiu-nv · 2026-05-03T16:59:30Z

kv_cache/test_prefix_aware_scheduling.py::TestServePrefixAwareScheduling::test_multi_round_qa_shared_prefix[swa-chunked] is a new test I added. Trying to understand if the failure is related to the code changes or flaky. Will report back with findings.

Waived in #13717.

rosenrodt requested review from a team as code owners April 30, 2026 14:50

rosenrodt requested review from hyukn and yuxianq April 30, 2026 14:50

github-actions Bot assigned rosenrodt Apr 30, 2026

rosenrodt requested review from bobboli, nekorobov and xxi-nv April 30, 2026 14:53

rosenrodt force-pushed the perf-trtllm-moe-dep branch from 7ebca22 to 665506c Compare May 3, 2026 07:42

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

Conversation

rosenrodt commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

rosenrodt commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

rosenrodt commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

coderabbitai Bot commented May 3, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

rosenrodt commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

SimengLiu-nv commented May 3, 2026

Uh oh!

SimengLiu-nv commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rosenrodt commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

rosenrodt commented May 3, 2026 •

edited

Loading