[ROCm] Allow bf16/bf16/fp32 in nvte_multi_tensor_gemm dispatcher by lizamd · Pull Request #573 · ROCm/TransformerEngine

lizamd · 2026-05-04T23:47:38Z

The is_supported_dtype check in nvte_multi_tensor_gemm previously required A==B==D for the fp16/bf16 path, which rejected the common bf16/bf16/fp32 case where the GEMM output is fp32 for gradient accumulation. This forced a fallback to multi_stream_cublas_gemm (a per-expert hipblaslt loop), bypassing the CK grouped GEMM kernel entirely on ROCm.

The CK FP16 dispatcher (ck_tile_grouped_gemm_fp16_dispatch) already supports independent D dtype via TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY (fp32, fp16, bf16). The wrapper check is the only thing that prevents it from being reached.

Relaxed to require A==B in fp16/bf16 and D in {fp32, fp16, bf16}, which matches what the CK dispatcher actually accepts. Verified on Qwen3-30B-A3B MoE training on MI355X (gfx950): fallback warning rate drops from ~1040/step (every GEMM) to ~28/step (~3% of shapes that the CK kernel itself rejects via Kernel::IsSupportedArgument). Throughput is essentially unchanged in this workload because hipblaslt's per-shape autotuning happens to be competitive with the hardcoded CK tile configs for these MoE shapes; the gain will materialize once the CK dispatcher gains more tile configs (or shape-aware tile selection by aggregate M).

This is a CUDA path file; the same patch applies to the AMD path via hipify. No CUDA-side behavior change since cuBLAS/cutlass dispatch on NVIDIA still requires A==B==D in the cutlass fast-path pre-conditions.

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

wenchenvincent · 2026-05-06T01:02:45Z

@matthiasdiener @aris134 Could you review this PR?

wangye805

Can you edit an existing test or add a new test showing that with your change, bf16/fp16 input and fp32 outputs are going through the ck flow correctly now? Also paste some benchmarking data to this ticket for future reference

aris134 · 2026-05-06T14:44:57Z

+    // CK FP16/BF16 grouped GEMM dispatcher (ck_tile_grouped_gemm_fp16_dispatch)
+    // already supports independent D dtype via TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY
+    // (fp32, fp16, bf16). The previous check required A==B==D, which incorrectly
+    // rejected the common bf16/bf16/fp32 case (training with fp32 gradient
+    // accumulation), forcing a fallback to the per-expert hipblaslt loop.
+    // Relaxed to require A==B in fp16/bf16 and D in {fp32, fp16, bf16}.


I think this explanation may be better suited for the PR description rather than an inline code comment.

aris134

Agreed that the CK dispatch logic supports bf16/f32 combination. I would remove the detailed history comment about the previous fallback behavior which is better suited to the PR itself.

lizamd · 2026-05-06T15:07:59Z

AMD General + @kang, ***@***.***> Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Aristotle ***@***.***> Sent: Wednesday, May 6, 2026 7:57:26 AM To: ROCm/TransformerEngine ***@***.***> Cc: Li, Liz ***@***.***>; Author ***@***.***> Subject: Re: [ROCm/TransformerEngine] [ROCm] Allow bf16/bf16/fp32 in nvte_multi_tensor_gemm dispatcher (PR #573) Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. @aris134 commented on this pull request.

________________________________ In transformer_engine/common/gemm/cublaslt_gemm.cu<#573 (comment)>:

+ // CK FP16/BF16 grouped GEMM dispatcher (ck_tile_grouped_gemm_fp16_dispatch)

+ // already supports independent D dtype via TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY + // (fp32, fp16, bf16). The previous check required A==B==D, which incorrectly + // rejected the common bf16/bf16/fp32 case (training with fp32 gradient + // accumulation), forcing a fallback to the per-expert hipblaslt loop. + // Relaxed to require A==B in fp16/bf16 and D in {fp32, fp16, bf16}. I think this explanation may be better suited for the PR description rather than an inline code comment. — Reply to this email directly, view it on GitHub<#573 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BGPJQJCXTJJJVEQPIRWHEST4ZNHFNAVCNFSM6AAAAACYQ2XVE2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DEMZXGA4TCMBVGU>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

The is_supported_dtype check in nvte_multi_tensor_gemm previously required A==B==D for the fp16/bf16 path, which rejected the common bf16/bf16/fp32 case where the GEMM output is fp32 for gradient accumulation. This forced a fallback to multi_stream_cublas_gemm (a per-expert hipblaslt loop), bypassing the CK grouped GEMM kernel entirely on ROCm. The CK FP16 dispatcher (ck_tile_grouped_gemm_fp16_dispatch) already supports independent D dtype via TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY (fp32, fp16, bf16). The wrapper check is the only thing that prevents it from being reached. Relaxed to require A==B in fp16/bf16 and D in {fp32, fp16, bf16}, which matches what the CK dispatcher actually accepts. Verified on Qwen3-30B-A3B MoE training on MI355X (gfx950): fallback warning rate drops from ~1040/step (every GEMM) to ~28/step (~3% of shapes that the CK kernel itself rejects via Kernel::IsSupportedArgument). Throughput is essentially unchanged in this workload because hipblaslt's per-shape autotuning happens to be competitive with the hardcoded CK tile configs for these MoE shapes; the gain will materialize once the CK dispatcher gains more tile configs (or shape-aware tile selection by aggregate M). This is a CUDA path file; the same patch applies to the AMD path via hipify. No CUDA-side behavior change since cuBLAS/cutlass dispatch on NVIDIA still requires A==B==D in the cutlass fast-path pre-conditions. Follow-ups (out of scope for this PR): - Add more CK tile configs (e.g. TileCfg_64x256x64, TileCfg_128x256x64) and shape-aware tile selection by aggregate M per call. Currently throughput is unchanged on this workload because the existing hipblaslt fallback is well-tuned and the 3 hardcoded CK tile configs (TileCfg_256x256x64, TileCfg_256x128x64, TileCfg_256x128x64_padding) don't fit MoE shapes (highly variable per-expert M) optimally. Real CK-grouped-GEMM perf wins will materialize once tile selection adapts to M. - Investigate the ~3% of GEMMs that hit Kernel::IsSupportedArgument rejection (likely small per-expert M values that fail tile-size constraints in the current TileCfg_256x* instantiations).

lizamd · 2026-05-07T21:59:29Z

@wangye805 @aris134 could you check the new commit?

ipanfilo · 2026-05-08T16:15:11Z

+)
+@pytest.mark.parametrize("input_dtype", [torch.bfloat16, torch.float16])
+@pytest.mark.parametrize("layout", ["TN", "NT"])
+def test_grouped_gemm_fp32_output(input_dtype, layout):


Can it be done by adding configs/parameters to test_grouped_gemm?

lizamd requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 4, 2026 23:47

lizamd force-pushed the fix/ck-grouped-gemm-bf16-fp32-output branch 2 times, most recently from 764cb65 to ff19241 Compare May 5, 2026 00:02

matthiasdiener added the ci-level 1 CI test level 1 label May 5, 2026

matthiasdiener requested review from aris134 and matthiasdiener May 5, 2026 15:36

wangye805 requested changes May 6, 2026

View reviewed changes

aris134 reviewed May 6, 2026

View reviewed changes

aris134 requested changes May 6, 2026

View reviewed changes

lizamd force-pushed the fix/ck-grouped-gemm-bf16-fp32-output branch from ff19241 to d416572 Compare May 7, 2026 17:45

ipanfilo requested changes May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Allow bf16/bf16/fp32 in nvte_multi_tensor_gemm dispatcher#573

[ROCm] Allow bf16/bf16/fp32 in nvte_multi_tensor_gemm dispatcher#573
lizamd wants to merge 1 commit into
ROCm:devfrom
lizamd:fix/ck-grouped-gemm-bf16-fp32-output

lizamd commented May 4, 2026

Uh oh!

wenchenvincent commented May 6, 2026

Uh oh!

wangye805 left a comment

Uh oh!

aris134 May 6, 2026

Uh oh!

aris134 left a comment

Uh oh!

lizamd commented May 6, 2026 via email

Uh oh!

lizamd commented May 7, 2026

Uh oh!

ipanfilo May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lizamd commented May 4, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

wenchenvincent commented May 6, 2026

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

aris134 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

aris134 left a comment

Choose a reason for hiding this comment

Uh oh!

lizamd commented May 6, 2026 via email

Uh oh!

lizamd commented May 7, 2026

Uh oh!

ipanfilo May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants