Add SP benchmarks for GPT-OSS-120B MoE model (GEMM+RS, RS+RMSNorm, E2E) by aamarnat · Pull Request #513 · ROCm/iris

aamarnat · 2026-04-21T20:11:56Z

Summary

Add three Sequence Parallelism (SP) benchmarks targeting the GPT-OSS-120B MoE model. These cover the SP segment between attention output and MoE input:

O_proj (row-parallel): [M, K_local] x [K_local, N] → partial [M, N]
  → Reduce-Scatter → [M/tp, N]
  → RMSNorm on [M/tp, N]
  → (MoE receives [M/tp, N] directly — handles its own AG at exit)

K = 64 × 64 = 4096, N = 2880. M values: 32 (decode), 896 (hybrid), 2048 (prefill).

New files

benchmark/ops/bench_matmul_reduce_scatter_stages.py — Fused GEMM + Reduce-Scatter stage profiler. Compares unfused torch.mm + dist.reduce_scatter_tensor against iris matmul_reduce_scatter with tile-config sweep (bm × bn). Models the O_proj row-parallel GEMM followed by RS.
benchmark/ops/bench_rs_rmsnorm.py — RS + RMSNorm stage profiler. Compares unfused NCCL reduce_scatter_tensor + aiter Triton RMSNorm against an iris shmem "fused-ready" variant (same ops but buffers allocated in iris shared memory, compatible with aiter fused kernel calling convention).
benchmark/ops/bench_sp_layer_e2e.py — End-to-end SP segment benchmark combining all three stages (GEMM + RS + RMSNorm). Compares unfused torch.mm + dist.reduce_scatter_tensor + RMSNorm against fused iris matmul_reduce_scatter + RMSNorm, with tile-config sweep.

All three benchmarks use the iris.bench framework, sweep across TP degrees (2, 4, 8 ranks), and report FLOPs and communication bytes.

Test plan

Run benchmark/ops/bench_matmul_reduce_scatter_stages.py with 2, 4, 8 ranks
Run benchmark/ops/bench_rs_rmsnorm.py with 2, 4, 8 ranks
Run benchmark/ops/bench_sp_layer_e2e.py with 2, 4, 8 ranks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aamarnat requested review from BKP, mawad-amd and neoblizz as code owners April 21, 2026 20:11

github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 21, 2026

aamarnat and others added 2 commits April 21, 2026 20:27

SP benchmarks for GPT-OSS-120B MoE model, GEMM + RS + RMSNorm

63f4b80

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apply Ruff auto-fixes

0dae5f3

aamarnat force-pushed the aamarnat/sp_benchmarks branch from 30878cf to 0dae5f3 Compare April 21, 2026 20:27

mawad-amd approved these changes Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SP benchmarks for GPT-OSS-120B MoE model (GEMM+RS, RS+RMSNorm, E2E)#513

Add SP benchmarks for GPT-OSS-120B MoE model (GEMM+RS, RS+RMSNorm, E2E)#513
aamarnat wants to merge 2 commits intomainfrom
aamarnat/sp_benchmarks

aamarnat commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aamarnat commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aamarnat commented Apr 21, 2026 •

edited

Loading