Add SP benchmarks for GPT-OSS-120B MoE model (GEMM+RS, RS+RMSNorm, E2E)#513
Open
Add SP benchmarks for GPT-OSS-120B MoE model (GEMM+RS, RS+RMSNorm, E2E)#513
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
30878cf to
0dae5f3
Compare
mawad-amd
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add three Sequence Parallelism (SP) benchmarks targeting the GPT-OSS-120B MoE model. These cover the SP segment between attention output and MoE input:
K = 64 × 64 = 4096, N = 2880. M values: 32 (decode), 896 (hybrid), 2048 (prefill).
New files
benchmark/ops/bench_matmul_reduce_scatter_stages.py— Fused GEMM + Reduce-Scatter stage profiler. Compares unfusedtorch.mm + dist.reduce_scatter_tensoragainst irismatmul_reduce_scatterwith tile-config sweep (bm × bn). Models the O_proj row-parallel GEMM followed by RS.benchmark/ops/bench_rs_rmsnorm.py— RS + RMSNorm stage profiler. Compares unfused NCCLreduce_scatter_tensor+ aiter Triton RMSNorm against an iris shmem "fused-ready" variant (same ops but buffers allocated in iris shared memory, compatible with aiter fused kernel calling convention).benchmark/ops/bench_sp_layer_e2e.py— End-to-end SP segment benchmark combining all three stages (GEMM + RS + RMSNorm). Compares unfusedtorch.mm + dist.reduce_scatter_tensor + RMSNormagainst fusediris matmul_reduce_scatter + RMSNorm, with tile-config sweep.All three benchmarks use the
iris.benchframework, sweep across TP degrees (2, 4, 8 ranks), and report FLOPs and communication bytes.Test plan
benchmark/ops/bench_matmul_reduce_scatter_stages.pywith 2, 4, 8 ranksbenchmark/ops/bench_rs_rmsnorm.pywith 2, 4, 8 ranksbenchmark/ops/bench_sp_layer_e2e.pywith 2, 4, 8 ranks