Microbenchmarking, Torch+CSV-based by matthiasdiener · Pull Request #478 · ROCm/TransformerEngine

matthiasdiener · 2026-03-10T15:43:14Z

Description

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added a set of 6 microbenchmarks (attention, fp16 gemm, fp8 gemm, fp16 grouped gemm, normalization, casting)
Create a test within CI that compares microbenchmark performance between PR branch and base branch

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

github-actions · 2026-03-11T21:21:06Z

Performance Regression Report

MI325

_{PR commit: ddd17d4 | Base: dev | 2026-03-17 18:30:52 CDT}

Benchmark suite	Median speedup	Min speedup	Max speedup
benchmark_attention	1.000x	0.650x	1.390x
benchmark_casting	0.998x	0.920x	1.034x
benchmark_gemm	1.001x	0.322x	2.361x
benchmark_gemm_fp8	0.986x	0.285x	1.610x
benchmark_grouped_gemm	1.000x	0.439x	2.438x
benchmark_normalization	1.006x	0.633x	2.066x

benchmark_attention (median 1.000x, min 0.650x, max 1.390x)

Case	batch	seq_len	num_q_heads	num_kv_heads	head_dim	TE Forward Base	TE Forward PR	TE Forward Speedup	TE Backward Base	TE Backward PR	TE Backward Speedup
Llama3-8B/TP1	2	1024	32	8	128	257.82	256.21	0.994x	180.08	198.09	1.100x
Llama3-8B/TP1	2	2048	32	8	128	385.63	384.63	0.997x	236.46	236.36	1.000x
Llama3-8B/TP1	2	4096	32	8	128	543.04	543.30	1.000x	275.26	275.64	1.001x
Llama3-8B/TP1	2	8192	32	8	128	705.72	709.27	1.005x	373.42	372.25	0.997x
Llama3-8B/TP8	2	1024	4	1	128	34.92	34.50	0.988x	22.25	21.58	0.970x
Llama3-8B/TP8	2	2048	4	1	128	125.61	124.96	0.995x	91.81	89.84	0.979x
Llama3-8B/TP8	2	4096	4	1	128	305.17	306.69	1.005x	255.60	255.78	1.001x
Llama3-8B/TP8	2	8192	4	1	128	406.33	523.55	1.288x	338.50	288.46	0.852x
Llama3-70B/TP8	2	1024	8	1	128	69.69	69.41	0.996x	44.22	44.54	1.007x
Llama3-70B/TP8	2	2048	8	1	128	236.62	238.34	1.007x	188.51	187.98	0.997x
Llama3-70B/TP8	2	4096	8	1	128	496.23	322.34	0.650x	238.89	300.79	1.259x
Llama3-70B/TP8	2	8192	8	1	128	617.34	616.97	0.999x	353.27	269.61	0.763x
Llama3-405B/TP8	2	1024	16	1	128	140.00	138.95	0.992x	89.30	124.16	1.390x
Llama3-405B/TP8	2	2048	16	1	128	364.79	365.62	1.002x	237.26	238.10	1.004x
Llama3-405B/TP8	2	4096	16	1	128	527.40	526.97	0.999x	258.35	240.51	0.931x
Llama3-405B/TP8	2	8192	16	1	128	697.06	697.91	1.001x	361.86	363.01	1.003x
Qwen2.5-7B/TP1	2	1024	28	4	128	230.63	234.43	1.016x	182.59	181.72	0.995x
Qwen2.5-7B/TP1	2	2048	28	4	128	404.59	405.18	1.001x	272.98	271.30	0.994x
Qwen2.5-7B/TP1	2	4096	28	4	128	590.77	517.93	0.877x	290.83	291.70	1.003x
Qwen2.5-7B/TP1	2	8192	28	4	128	707.55	687.44	0.972x	370.92	371.85	1.003x
Qwen2.5-72B/TP8	2	1024	8	1	128	67.26	68.40	1.017x	63.04	44.08	0.699x
Qwen2.5-72B/TP8	2	2048	8	1	128	242.45	238.28	0.983x	232.54	156.39	0.673x
Qwen2.5-72B/TP8	2	4096	8	1	128	493.67	494.51	1.002x	259.64	259.79	1.001x
Qwen2.5-72B/TP8	2	8192	8	1	128	615.70	616.16	1.001x	340.26	352.74	1.037x

benchmark_casting (median 0.998x, min 0.920x, max 1.034x)

Case	M	hidden_size	dtype_str	Cast GB/s Base	Cast GB/s PR	Cast GB/s Speedup
Llama3-8B/BF16-to-FP8-E4M3	1024	4096	BF16-to-FP8-E4M3	731.70	704.60	0.963x
Llama3-8B/BF16-to-FP8-E4M3	2048	4096	BF16-to-FP8-E4M3	1254.70	1250.10	0.996x
Llama3-8B/BF16-to-FP8-E4M3	4096	4096	BF16-to-FP8-E4M3	2160.70	2220.80	1.028x
Llama3-8B/BF16-to-FP8-E4M3	8192	4096	BF16-to-FP8-E4M3	2583.80	2569.50	0.994x
Llama3-8B/FP8-E4M3-to-BF16	1024	4096	FP8-E4M3-to-BF16	1115.30	1026.00	0.920x
Llama3-8B/FP8-E4M3-to-BF16	2048	4096	FP8-E4M3-to-BF16	2365.20	2337.20	0.988x
Llama3-8B/FP8-E4M3-to-BF16	4096	4096	FP8-E4M3-to-BF16	3631.80	3662.40	1.008x
Llama3-8B/FP8-E4M3-to-BF16	8192	4096	FP8-E4M3-to-BF16	3916.40	3737.90	0.954x
Llama3-8B/BF16-to-FP8-E5M2	1024	4096	BF16-to-FP8-E5M2	747.60	750.00	1.003x
Llama3-8B/BF16-to-FP8-E5M2	2048	4096	BF16-to-FP8-E5M2	1260.10	1250.20	0.992x
Llama3-8B/BF16-to-FP8-E5M2	4096	4096	BF16-to-FP8-E5M2	2221.30	2217.50	0.998x
Llama3-8B/BF16-to-FP8-E5M2	8192	4096	BF16-to-FP8-E5M2	2489.00	2573.20	1.034x
Llama3-8B/FP8-E5M2-to-BF16	1024	4096	FP8-E5M2-to-BF16	1167.00	1195.00	1.024x
Llama3-8B/FP8-E5M2-to-BF16	2048	4096	FP8-E5M2-to-BF16	2411.00	2411.60	1.000x
Llama3-8B/FP8-E5M2-to-BF16	4096	4096	FP8-E5M2-to-BF16	3701.80	3693.20	0.998x
Llama3-8B/FP8-E5M2-to-BF16	8192	4096	FP8-E5M2-to-BF16	3962.00	3980.50	1.005x
Llama3-70B/BF16-to-FP8-E4M3	1024	8192	BF16-to-FP8-E4M3	1230.40	1233.00	1.002x
Llama3-70B/BF16-to-FP8-E4M3	2048	8192	BF16-to-FP8-E4M3	2298.20	2310.10	1.005x
Llama3-70B/BF16-to-FP8-E4M3	4096	8192	BF16-to-FP8-E4M3	2682.70	2576.40	0.960x
Llama3-70B/BF16-to-FP8-E4M3	8192	8192	BF16-to-FP8-E4M3	1747.30	1723.50	0.986x
Llama3-70B/FP8-E4M3-to-BF16	1024	8192	FP8-E4M3-to-BF16	2368.60	2244.90	0.948x
Llama3-70B/FP8-E4M3-to-BF16	2048	8192	FP8-E4M3-to-BF16	3685.50	3692.10	1.002x
Llama3-70B/FP8-E4M3-to-BF16	4096	8192	FP8-E4M3-to-BF16	4001.20	3972.50	0.993x
Llama3-70B/FP8-E4M3-to-BF16	8192	8192	FP8-E4M3-to-BF16	4310.00	4264.40	0.989x
Llama3-70B/BF16-to-FP8-E5M2	1024	8192	BF16-to-FP8-E5M2	1173.10	1170.30	0.998x
Llama3-70B/BF16-to-FP8-E5M2	2048	8192	BF16-to-FP8-E5M2	2295.70	2306.90	1.005x
Llama3-70B/BF16-to-FP8-E5M2	4096	8192	BF16-to-FP8-E5M2	2682.40	2576.90	0.961x
Llama3-70B/BF16-to-FP8-E5M2	8192	8192	BF16-to-FP8-E5M2	1761.80	1678.20	0.953x
Llama3-70B/FP8-E5M2-to-BF16	1024	8192	FP8-E5M2-to-BF16	2414.50	2253.20	0.933x
Llama3-70B/FP8-E5M2-to-BF16	2048	8192	FP8-E5M2-to-BF16	3677.70	3674.90	0.999x
Llama3-70B/FP8-E5M2-to-BF16	4096	8192	FP8-E5M2-to-BF16	4029.50	3958.10	0.982x
Llama3-70B/FP8-E5M2-to-BF16	8192	8192	FP8-E5M2-to-BF16	4307.00	4271.90	0.992x
Llama3-405B/BF16-to-FP8-E4M3	1024	16384	BF16-to-FP8-E4M3	2063.50	2076.70	1.006x
Llama3-405B/BF16-to-FP8-E4M3	2048	16384	BF16-to-FP8-E4M3	2323.90	2321.60	0.999x
Llama3-405B/BF16-to-FP8-E4M3	4096	16384	BF16-to-FP8-E4M3	1858.90	1836.20	0.988x
Llama3-405B/BF16-to-FP8-E4M3	8192	16384	BF16-to-FP8-E4M3	1183.70	1178.40	0.996x
Llama3-405B/FP8-E4M3-to-BF16	1024	16384	FP8-E4M3-to-BF16	3685.10	3698.70	1.004x
Llama3-405B/FP8-E4M3-to-BF16	2048	16384	FP8-E4M3-to-BF16	4006.40	4030.60	1.006x
Llama3-405B/FP8-E4M3-to-BF16	4096	16384	FP8-E4M3-to-BF16	4426.90	4412.00	0.997x
Llama3-405B/FP8-E4M3-to-BF16	8192	16384	FP8-E4M3-to-BF16	3714.70	3748.10	1.009x
Llama3-405B/BF16-to-FP8-E5M2	1024	16384	BF16-to-FP8-E5M2	2067.20	2096.30	1.014x
Llama3-405B/BF16-to-FP8-E5M2	2048	16384	BF16-to-FP8-E5M2	2320.10	2335.40	1.007x
Llama3-405B/BF16-to-FP8-E5M2	4096	16384	BF16-to-FP8-E5M2	1855.60	1834.70	0.989x
Llama3-405B/BF16-to-FP8-E5M2	8192	16384	BF16-to-FP8-E5M2	1183.80	1175.30	0.993x
Llama3-405B/FP8-E5M2-to-BF16	1024	16384	FP8-E5M2-to-BF16	3697.50	3692.20	0.999x
Llama3-405B/FP8-E5M2-to-BF16	2048	16384	FP8-E5M2-to-BF16	3997.10	4031.70	1.009x
Llama3-405B/FP8-E5M2-to-BF16	4096	16384	FP8-E5M2-to-BF16	4427.20	4429.10	1.000x
Llama3-405B/FP8-E5M2-to-BF16	8192	16384	FP8-E5M2-to-BF16	3692.60	3740.70	1.013x
Qwen2.5-7B/BF16-to-FP8-E4M3	1024	3584	BF16-to-FP8-E4M3	505.90	503.60	0.995x
Qwen2.5-7B/BF16-to-FP8-E4M3	2048	3584	BF16-to-FP8-E4M3	967.80	957.80	0.990x
Qwen2.5-7B/BF16-to-FP8-E4M3	4096	3584	BF16-to-FP8-E4M3	1404.30	1409.20	1.003x
Qwen2.5-7B/BF16-to-FP8-E4M3	8192	3584	BF16-to-FP8-E4M3	2593.90	2608.70	1.006x
Qwen2.5-7B/FP8-E4M3-to-BF16	1024	3584	FP8-E4M3-to-BF16	1000.80	1012.10	1.011x
Qwen2.5-7B/FP8-E4M3-to-BF16	2048	3584	FP8-E4M3-to-BF16	2081.80	1999.70	0.961x
Qwen2.5-7B/FP8-E4M3-to-BF16	4096	3584	FP8-E4M3-to-BF16	3708.30	3714.50	1.002x
Qwen2.5-7B/FP8-E4M3-to-BF16	8192	3584	FP8-E4M3-to-BF16	3864.60	3862.80	1.000x
Qwen2.5-7B/BF16-to-FP8-E5M2	1024	3584	BF16-to-FP8-E5M2	505.70	503.60	0.996x
Qwen2.5-7B/BF16-to-FP8-E5M2	2048	3584	BF16-to-FP8-E5M2	968.00	959.20	0.991x
Qwen2.5-7B/BF16-to-FP8-E5M2	4096	3584	BF16-to-FP8-E5M2	1406.20	1408.70	1.002x
Qwen2.5-7B/BF16-to-FP8-E5M2	8192	3584	BF16-to-FP8-E5M2	2575.50	2594.60	1.007x
Qwen2.5-7B/FP8-E5M2-to-BF16	1024	3584	FP8-E5M2-to-BF16	1038.10	998.30	0.962x
Qwen2.5-7B/FP8-E5M2-to-BF16	2048	3584	FP8-E5M2-to-BF16	2111.80	2033.60	0.963x
Qwen2.5-7B/FP8-E5M2-to-BF16	4096	3584	FP8-E5M2-to-BF16	3715.70	3724.60	1.002x
Qwen2.5-7B/FP8-E5M2-to-BF16	8192	3584	FP8-E5M2-to-BF16	3895.90	3881.40	0.996x
Qwen2.5-72B/BF16-to-FP8-E4M3	1024	8192	BF16-to-FP8-E4M3	1242.90	1243.80	1.001x
Qwen2.5-72B/BF16-to-FP8-E4M3	2048	8192	BF16-to-FP8-E4M3	2292.30	2307.50	1.007x
Qwen2.5-72B/BF16-to-FP8-E4M3	4096	8192	BF16-to-FP8-E4M3	2690.00	2577.50	0.958x
Qwen2.5-72B/BF16-to-FP8-E4M3	8192	8192	BF16-to-FP8-E4M3	1781.50	1748.70	0.982x
Qwen2.5-72B/FP8-E4M3-to-BF16	1024	8192	FP8-E4M3-to-BF16	2415.80	2333.40	0.966x
Qwen2.5-72B/FP8-E4M3-to-BF16	2048	8192	FP8-E4M3-to-BF16	3688.30	3697.70	1.003x
Qwen2.5-72B/FP8-E4M3-to-BF16	4096	8192	FP8-E4M3-to-BF16	3996.00	3997.00	1.000x
Qwen2.5-72B/FP8-E4M3-to-BF16	8192	8192	FP8-E4M3-to-BF16	4299.10	4295.80	0.999x
Qwen2.5-72B/BF16-to-FP8-E5M2	1024	8192	BF16-to-FP8-E5M2	1182.50	1174.30	0.993x
Qwen2.5-72B/BF16-to-FP8-E5M2	2048	8192	BF16-to-FP8-E5M2	2295.40	2310.90	1.007x
Qwen2.5-72B/BF16-to-FP8-E5M2	4096	8192	BF16-to-FP8-E5M2	2699.50	2582.60	0.957x
Qwen2.5-72B/BF16-to-FP8-E5M2	8192	8192	BF16-to-FP8-E5M2	1784.60	1746.70	0.979x
Qwen2.5-72B/FP8-E5M2-to-BF16	1024	8192	FP8-E5M2-to-BF16	2393.60	2315.80	0.967x
Qwen2.5-72B/FP8-E5M2-to-BF16	2048	8192	FP8-E5M2-to-BF16	3703.50	3698.20	0.999x
Qwen2.5-72B/FP8-E5M2-to-BF16	4096	8192	FP8-E5M2-to-BF16	4009.30	4007.20	0.999x
Qwen2.5-72B/FP8-E5M2-to-BF16	8192	8192	FP8-E5M2-to-BF16	4419.90	4283.00	0.969x

benchmark_gemm (median 1.001x, min 0.322x, max 2.361x)

Case	M	N	K	dtype	TE Forward Base	TE Forward PR	TE Forward Speedup	TE Backward Base	TE Backward PR	TE Backward Speedup
Llama3-8B/TP1-QKV	1024	6144	4096	torch.bfloat16	591.68	540.88	0.914x	565.78	578.25	1.022x
Llama3-8B/TP1-AttnOut	1024	4096	4096	torch.bfloat16	553.06	551.29	0.997x	524.17	526.60	1.005x
Llama3-8B/TP1-GateUp	1024	28672	4096	torch.bfloat16	746.97	747.83	1.001x	416.28	414.18	0.995x
Llama3-8B/TP1-Down	1024	4096	14336	torch.bfloat16	582.34	581.93	0.999x	668.69	666.49	0.997x
Llama3-8B/TP1-QKV	2048	6144	4096	torch.bfloat16	677.20	680.06	1.004x	655.94	632.32	0.964x
Llama3-8B/TP1-AttnOut	2048	4096	4096	torch.bfloat16	564.90	565.15	1.000x	658.52	650.44	0.988x
Llama3-8B/TP1-GateUp	2048	28672	4096	torch.bfloat16	648.45	731.64	1.128x	535.40	544.40	1.017x
Llama3-8B/TP1-Down	2048	4096	14336	torch.bfloat16	644.70	644.60	1.000x	707.26	707.21	1.000x
Llama3-8B/TP1-QKV	4096	6144	4096	torch.bfloat16	682.29	728.15	1.067x	726.50	709.37	0.976x
Llama3-8B/TP1-AttnOut	4096	4096	4096	torch.bfloat16	677.77	701.16	1.035x	735.75	733.85	0.997x
Llama3-8B/TP1-GateUp	4096	28672	4096	torch.bfloat16	675.27	758.75	1.124x	810.45	760.78	0.939x
Llama3-8B/TP1-Down	4096	4096	14336	torch.bfloat16	748.83	642.09	0.857x	782.79	848.88	1.084x
Llama3-8B/TP1-QKV	8192	6144	4096	torch.bfloat16	744.54	743.87	0.999x	737.18	680.03	0.922x
Llama3-8B/TP1-AttnOut	8192	4096	4096	torch.bfloat16	730.74	735.51	1.007x	751.13	743.40	0.990x
Llama3-8B/TP1-GateUp	8192	28672	4096	torch.bfloat16	755.67	769.35	1.018x	769.86	728.81	0.947x
Llama3-8B/TP1-Down	8192	4096	14336	torch.bfloat16	710.39	712.41	1.003x	770.32	733.15	0.952x
Llama3-8B/TP8-QKV	1024	768	4096	torch.bfloat16	152.18	151.90	0.998x	57.00	55.64	0.976x
Llama3-8B/TP8-AttnOut	1024	4096	512	torch.bfloat16	101.15	101.61	1.005x	36.73	37.60	1.024x
Llama3-8B/TP8-GateUp	1024	3584	4096	torch.bfloat16	523.70	524.64	1.002x	275.69	278.79	1.011x
Llama3-8B/TP8-Down	1024	4096	1792	torch.bfloat16	355.82	359.89	1.011x	125.42	126.47	1.008x
Llama3-8B/TP8-QKV	2048	768	4096	torch.bfloat16	299.05	300.83	1.006x	111.31	110.81	0.996x
Llama3-8B/TP8-AttnOut	2048	4096	512	torch.bfloat16	203.38	203.75	1.002x	74.00	74.44	1.006x
Llama3-8B/TP8-GateUp	2048	3584	4096	torch.bfloat16	591.88	583.00	0.985x	619.07	622.05	1.005x
Llama3-8B/TP8-Down	2048	4096	1792	torch.bfloat16	549.18	561.59	1.023x	265.03	272.37	1.028x
Llama3-8B/TP8-QKV	4096	768	4096	torch.bfloat16	492.65	497.52	1.010x	223.78	229.66	1.026x
Llama3-8B/TP8-AttnOut	4096	4096	512	torch.bfloat16	399.21	407.65	1.021x	145.15	136.23	0.939x
Llama3-8B/TP8-GateUp	4096	3584	4096	torch.bfloat16	708.43	709.10	1.001x	702.34	720.58	1.026x
Llama3-8B/TP8-Down	4096	4096	1792	torch.bfloat16	642.92	639.09	0.994x	590.33	598.71	1.014x
Llama3-8B/TP8-QKV	8192	768	4096	torch.bfloat16	591.15	588.95	0.996x	507.42	509.56	1.004x
Llama3-8B/TP8-AttnOut	8192	4096	512	torch.bfloat16	489.20	501.70	1.026x	332.85	329.49	0.990x
Llama3-8B/TP8-GateUp	8192	3584	4096	torch.bfloat16	764.35	765.54	1.002x	728.82	734.48	1.008x
Llama3-8B/TP8-Down	8192	4096	1792	torch.bfloat16	608.34	654.30	1.076x	701.91	667.24	0.951x
Llama3-70B/TP8-QKV	1024	1280	8192	torch.bfloat16	417.74	414.70	0.993x	198.45	193.12	0.973x
Llama3-70B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	406.15	411.43	1.013x	145.32	146.47	1.008x
Llama3-70B/TP8-GateUp	1024	7168	8192	torch.bfloat16	700.72	697.09	0.995x	498.79	652.31	1.308x
Llama3-70B/TP8-Down	1024	8192	3584	torch.bfloat16	580.39	597.17	1.029x	570.68	565.70	0.991x
Llama3-70B/TP8-QKV	2048	1280	8192	torch.bfloat16	498.32	500.43	1.004x	439.08	454.43	1.035x
Llama3-70B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	554.32	541.12	0.976x	316.73	329.96	1.042x
Llama3-70B/TP8-GateUp	2048	7168	8192	torch.bfloat16	758.46	754.91	0.995x	607.20	742.45	1.223x
Llama3-70B/TP8-Down	2048	8192	3584	torch.bfloat16	699.38	691.39	0.989x	514.35	656.10	1.276x
Llama3-70B/TP8-QKV	4096	1280	8192	torch.bfloat16	519.74	527.64	1.015x	614.42	610.07	0.993x
Llama3-70B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	599.72	598.24	0.998x	609.34	597.44	0.980x
Llama3-70B/TP8-GateUp	4096	7168	8192	torch.bfloat16	326.94	771.77	2.361x	2374.71	764.81	0.322x
Llama3-70B/TP8-Down	4096	8192	3584	torch.bfloat16	696.83	695.70	0.998x	739.45	739.89	1.001x
Llama3-70B/TP8-QKV	8192	1280	8192	torch.bfloat16	732.43	731.77	0.999x	676.91	678.74	1.003x
Llama3-70B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	608.01	601.17	0.989x	657.81	671.50	1.021x
Llama3-70B/TP8-GateUp	8192	7168	8192	torch.bfloat16	795.07	795.29	1.000x	780.07	778.58	0.998x
Llama3-70B/TP8-Down	8192	8192	3584	torch.bfloat16	707.68	709.46	1.003x	761.20	761.80	1.001x
Llama3-405B/TP8-QKV	1024	2304	16384	torch.bfloat16	547.37	539.73	0.986x	603.38	601.90	0.998x
Llama3-405B/TP8-AttnOut	1024	16384	2048	torch.bfloat16	564.40	566.00	1.003x	537.46	536.20	0.998x
Llama3-405B/TP8-GateUp	1024	13312	16384	torch.bfloat16	671.74	679.53	1.012x	708.74	700.37	0.988x
Llama3-405B/TP8-Down	1024	16384	6656	torch.bfloat16	661.08	660.55	0.999x	517.57	515.50	0.996x
Llama3-405B/TP8-QKV	2048	2304	16384	torch.bfloat16	593.80	593.45	0.999x	688.76	686.37	0.997x
Llama3-405B/TP8-AttnOut	2048	16384	2048	torch.bfloat16	616.14	614.20	0.997x	646.91	651.50	1.007x
Llama3-405B/TP8-GateUp	2048	13312	16384	torch.bfloat16	664.29	671.27	1.011x	745.92	705.88	0.946x
Llama3-405B/TP8-Down	2048	16384	6656	torch.bfloat16	694.89	693.52	0.998x	640.86	702.29	1.096x
Llama3-405B/TP8-QKV	4096	2304	16384	torch.bfloat16	702.97	709.60	1.009x	626.73	719.54	1.148x
Llama3-405B/TP8-AttnOut	4096	16384	2048	torch.bfloat16	702.37	703.01	1.001x	518.03	600.56	1.159x
Llama3-405B/TP8-GateUp	4096	13312	16384	torch.bfloat16	621.31	658.78	1.060x	788.88	762.88	0.967x
Llama3-405B/TP8-Down	4096	16384	6656	torch.bfloat16	735.20	729.72	0.993x	713.31	716.14	1.004x
Llama3-405B/TP8-QKV	8192	2304	16384	torch.bfloat16	623.53	611.59	0.981x	735.79	744.59	1.012x
Llama3-405B/TP8-AttnOut	8192	16384	2048	torch.bfloat16	706.78	705.36	0.998x	748.39	681.01	0.910x
Llama3-405B/TP8-GateUp	8192	13312	16384	torch.bfloat16	679.56	668.72	0.984x	712.11	776.83	1.091x
Llama3-405B/TP8-Down	8192	16384	6656	torch.bfloat16	725.54	724.35	0.998x	733.33	717.45	0.978x
Qwen2.5-7B/TP1-QKV	1024	4608	3584	torch.bfloat16	560.14	563.73	1.006x	304.27	313.21	1.029x
Qwen2.5-7B/TP1-AttnOut	1024	3584	3584	torch.bfloat16	508.56	496.35	0.976x	233.33	235.40	1.009x
Qwen2.5-7B/TP1-GateUp	1024	37888	3584	torch.bfloat16	647.35	645.31	0.997x	592.55	594.47	1.003x
Qwen2.5-7B/TP1-Down	1024	3584	18944	torch.bfloat16	597.84	603.16	1.009x	657.60	662.62	1.008x
Qwen2.5-7B/TP1-QKV	2048	4608	3584	torch.bfloat16	636.82	637.18	1.001x	606.07	598.09	0.987x
Qwen2.5-7B/TP1-AttnOut	2048	3584	3584	torch.bfloat16	574.44	584.08	1.017x	561.73	512.25	0.912x
Qwen2.5-7B/TP1-GateUp	2048	37888	3584	torch.bfloat16	627.21	638.95	1.019x	532.69	527.39	0.990x
Qwen2.5-7B/TP1-Down	2048	3584	18944	torch.bfloat16	618.05	620.19	1.003x	705.68	708.02	1.003x
Qwen2.5-7B/TP1-QKV	4096	4608	3584	torch.bfloat16	645.12	638.90	0.990x	719.66	722.77	1.004x
Qwen2.5-7B/TP1-AttnOut	4096	3584	3584	torch.bfloat16	694.93	664.31	0.956x	677.88	671.21	0.990x
Qwen2.5-7B/TP1-GateUp	4096	37888	3584	torch.bfloat16	680.47	686.40	1.009x	725.63	729.91	1.006x
Qwen2.5-7B/TP1-Down	4096	3584	18944	torch.bfloat16	617.73	715.70	1.159x	844.02	773.91	0.917x
Qwen2.5-7B/TP1-QKV	8192	4608	3584	torch.bfloat16	696.76	705.02	1.012x	736.71	733.10	0.995x
Qwen2.5-7B/TP1-AttnOut	8192	3584	3584	torch.bfloat16	743.13	745.12	1.003x	610.92	717.88	1.175x
Qwen2.5-7B/TP1-GateUp	8192	37888	3584	torch.bfloat16	675.14	706.40	1.046x	785.37	767.06	0.977x
Qwen2.5-7B/TP1-Down	8192	3584	18944	torch.bfloat16	675.69	673.23	0.996x	733.91	734.10	1.000x
Qwen2.5-72B/TP8-QKV	1024	1280	8192	torch.bfloat16	409.01	409.16	1.000x	190.95	189.72	0.994x
Qwen2.5-72B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	404.01	414.30	1.025x	146.48	146.40	0.999x
Qwen2.5-72B/TP8-GateUp	1024	7392	8192	torch.bfloat16	642.62	642.71	1.000x	612.83	614.72	1.003x
Qwen2.5-72B/TP8-Down	1024	8192	3696	torch.bfloat16	611.45	604.58	0.989x	479.04	485.55	1.014x
Qwen2.5-72B/TP8-QKV	2048	1280	8192	torch.bfloat16	497.06	496.35	0.999x	446.73	442.49	0.991x
Qwen2.5-72B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	541.94	528.69	0.976x	322.38	327.45	1.016x
Qwen2.5-72B/TP8-GateUp	2048	7392	8192	torch.bfloat16	676.61	678.88	1.003x	699.78	696.51	0.995x
Qwen2.5-72B/TP8-Down	2048	8192	3696	torch.bfloat16	627.88	626.47	0.998x	577.34	577.04	0.999x
Qwen2.5-72B/TP8-QKV	4096	1280	8192	torch.bfloat16	522.99	526.33	1.006x	608.79	607.74	0.998x
Qwen2.5-72B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	590.00	595.77	1.010x	594.40	592.51	0.997x
Qwen2.5-72B/TP8-GateUp	4096	7392	8192	torch.bfloat16	712.77	713.02	1.000x	723.54	660.74	0.913x
Qwen2.5-72B/TP8-Down	4096	8192	3696	torch.bfloat16	701.61	699.06	0.996x	672.36	674.50	1.003x
Qwen2.5-72B/TP8-QKV	8192	1280	8192	torch.bfloat16	716.06	718.70	1.004x	669.16	676.93	1.012x
Qwen2.5-72B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	601.04	601.07	1.000x	656.66	668.00	1.017x
Qwen2.5-72B/TP8-GateUp	8192	7392	8192	torch.bfloat16	754.89	757.19	1.003x	702.16	692.87	0.987x
Qwen2.5-72B/TP8-Down	8192	8192	3696	torch.bfloat16	593.87	712.95	1.201x	769.46	702.40	0.913x

benchmark_gemm_fp8 (median 0.986x, min 0.285x, max 1.610x)

Case	M	N	K	dtype	FP8 Forward Base	FP8 Forward PR	FP8 Forward Speedup	FP8 Backward Base	FP8 Backward PR	FP8 Backward Speedup
Llama3-8B/TP1-QKV	1024	6144	4096	torch.bfloat16	378.14	380.23	1.006x	396.68	222.57	0.561x
Llama3-8B/TP1-AttnOut	1024	4096	4096	torch.bfloat16	261.48	258.23	0.988x	180.05	149.01	0.828x
Llama3-8B/TP1-GateUp	1024	28672	4096	torch.bfloat16	441.97	446.80	1.011x	940.07	948.99	1.009x
Llama3-8B/TP1-Down	1024	4096	14336	torch.bfloat16	418.10	419.15	1.003x	938.10	737.39	0.786x
Llama3-8B/TP1-QKV	2048	6144	4096	torch.bfloat16	688.50	654.45	0.951x	927.69	841.70	0.907x
Llama3-8B/TP1-AttnOut	2048	4096	4096	torch.bfloat16	516.39	507.14	0.982x	890.86	343.41	0.385x
Llama3-8B/TP1-GateUp	2048	28672	4096	torch.bfloat16	703.37	698.54	0.993x	991.17	1000.88	1.010x
Llama3-8B/TP1-Down	2048	4096	14336	torch.bfloat16	644.41	656.80	1.019x	1026.81	1000.97	0.975x
Llama3-8B/TP1-QKV	4096	6144	4096	torch.bfloat16	860.17	855.63	0.995x	1006.47	1008.00	1.002x
Llama3-8B/TP1-AttnOut	4096	4096	4096	torch.bfloat16	756.96	750.96	0.992x	994.86	627.90	0.631x
Llama3-8B/TP1-GateUp	4096	28672	4096	torch.bfloat16	891.83	872.75	0.979x	1077.84	1074.88	0.997x
Llama3-8B/TP1-Down	4096	4096	14336	torch.bfloat16	776.55	777.02	1.001x	1209.00	1205.00	0.997x
Llama3-8B/TP1-QKV	8192	6144	4096	torch.bfloat16	978.53	972.66	0.994x	1052.30	925.52	0.880x
Llama3-8B/TP1-AttnOut	8192	4096	4096	torch.bfloat16	885.61	957.24	1.081x	1133.74	1074.68	0.948x
Llama3-8B/TP1-GateUp	8192	28672	4096	torch.bfloat16	1084.89	1095.76	1.010x	1098.03	1068.04	0.973x
Llama3-8B/TP1-Down	8192	4096	14336	torch.bfloat16	859.54	782.10	0.910x	1195.36	1279.06	1.070x
Llama3-8B/TP8-QKV	1024	768	4096	torch.bfloat16	46.68	45.77	0.981x	49.17	30.54	0.621x
Llama3-8B/TP8-AttnOut	1024	4096	512	torch.bfloat16	31.01	23.09	0.745x	26.42	18.70	0.708x
Llama3-8B/TP8-GateUp	1024	3584	4096	torch.bfloat16	216.19	212.29	0.982x	396.30	206.62	0.521x
Llama3-8B/TP8-Down	1024	4096	1792	torch.bfloat16	107.89	105.90	0.982x	90.11	102.33	1.136x
Llama3-8B/TP8-QKV	2048	768	4096	torch.bfloat16	92.12	90.78	0.985x	71.60	86.40	1.207x
Llama3-8B/TP8-AttnOut	2048	4096	512	torch.bfloat16	60.85	60.43	0.993x	51.91	39.90	0.769x
Llama3-8B/TP8-GateUp	2048	3584	4096	torch.bfloat16	427.92	420.70	0.983x	358.77	276.59	0.771x
Llama3-8B/TP8-Down	2048	4096	1792	torch.bfloat16	212.98	209.85	0.985x	179.69	51.16	0.285x
Llama3-8B/TP8-QKV	4096	768	4096	torch.bfloat16	181.83	180.56	0.993x	153.02	98.67	0.645x
Llama3-8B/TP8-AttnOut	4096	4096	512	torch.bfloat16	121.14	119.86	0.989x	102.16	66.09	0.647x
Llama3-8B/TP8-GateUp	4096	3584	4096	torch.bfloat16	746.10	734.26	0.984x	748.24	468.74	0.626x
Llama3-8B/TP8-Down	4096	4096	1792	torch.bfloat16	418.60	385.64	0.921x	359.98	231.24	0.642x
Llama3-8B/TP8-QKV	8192	768	4096	torch.bfloat16	359.00	353.96	0.986x	303.24	194.26	0.641x
Llama3-8B/TP8-AttnOut	8192	4096	512	torch.bfloat16	239.68	235.41	0.982x	198.77	130.20	0.655x
Llama3-8B/TP8-GateUp	8192	3584	4096	torch.bfloat16	882.32	857.75	0.972x	853.81	1082.94	1.268x
Llama3-8B/TP8-Down	8192	4096	1792	torch.bfloat16	678.54	690.40	1.017x	764.79	474.50	0.620x
Llama3-70B/TP8-QKV	1024	1280	8192	torch.bfloat16	149.68	147.15	0.983x	126.81	80.49	0.635x
Llama3-70B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	119.57	117.04	0.979x	99.28	63.90	0.644x
Llama3-70B/TP8-GateUp	1024	7168	8192	torch.bfloat16	427.00	426.75	0.999x	804.59	584.35	0.726x
Llama3-70B/TP8-Down	1024	8192	3584	torch.bfloat16	415.27	407.05	0.980x	349.06	223.73	0.641x
Llama3-70B/TP8-QKV	2048	1280	8192	torch.bfloat16	297.18	292.71	0.985x	248.77	159.93	0.643x
Llama3-70B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	236.61	232.61	0.983x	196.43	127.03	0.647x
Llama3-70B/TP8-GateUp	2048	7168	8192	torch.bfloat16	610.89	502.97	0.823x	902.80	1453.88	1.610x
Llama3-70B/TP8-Down	2048	8192	3584	torch.bfloat16	700.47	713.50	1.019x	733.98	451.37	0.615x
Llama3-70B/TP8-QKV	4096	1280	8192	torch.bfloat16	542.17	546.10	1.007x	506.34	315.65	0.623x
Llama3-70B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	463.04	458.40	0.990x	391.22	273.45	0.699x
Llama3-70B/TP8-GateUp	4096	7168	8192	torch.bfloat16	841.38	820.94	0.976x	1151.14	1186.75	1.031x
Llama3-70B/TP8-Down	4096	8192	3584	torch.bfloat16	882.45	842.82	0.955x	1026.79	1054.82	1.027x
Llama3-70B/TP8-QKV	8192	1280	8192	torch.bfloat16	443.03	446.33	1.007x	962.61	955.33	0.992x
Llama3-70B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	675.66	670.79	0.993x	546.66	544.59	0.996x
Llama3-70B/TP8-GateUp	8192	7168	8192	torch.bfloat16	863.77	860.41	0.996x	1236.13	1158.05	0.937x
Llama3-70B/TP8-Down	8192	8192	3584	torch.bfloat16	976.25	972.75	0.996x	1045.82	1051.59	1.006x
Llama3-405B/TP8-QKV	1024	2304	16384	torch.bfloat16	488.99	492.35	1.007x	439.79	433.71	0.986x
Llama3-405B/TP8-AttnOut	1024	16384	2048	torch.bfloat16	454.92	464.26	1.021x	384.34	381.92	0.994x
Llama3-405B/TP8-GateUp	1024	13312	16384	torch.bfloat16	479.77	482.22	1.005x	1055.76	1047.58	0.992x
Llama3-405B/TP8-Down	1024	16384	6656	torch.bfloat16	466.09	467.39	1.003x	925.94	921.43	0.995x
Llama3-405B/TP8-QKV	2048	2304	16384	torch.bfloat16	674.19	649.05	0.963x	951.73	954.59	1.003x
Llama3-405B/TP8-AttnOut	2048	16384	2048	torch.bfloat16	706.04	701.10	0.993x	814.34	819.39	1.006x
Llama3-405B/TP8-GateUp	2048	13312	16384	torch.bfloat16	689.36	700.17	1.016x	1124.65	1040.51	0.925x
Llama3-405B/TP8-Down	2048	16384	6656	torch.bfloat16	665.66	559.19	0.840x	1046.41	1222.82	1.169x
Llama3-405B/TP8-QKV	4096	2304	16384	torch.bfloat16	599.33	606.85	1.013x	1133.79	1107.40	0.977x
Llama3-405B/TP8-AttnOut	4096	16384	2048	torch.bfloat16	824.37	826.55	1.003x	823.46	829.47	1.007x
Llama3-405B/TP8-GateUp	4096	13312	16384	torch.bfloat16	835.02	828.55	0.992x	1180.13	1212.87	1.028x
Llama3-405B/TP8-Down	4096	16384	6656	torch.bfloat16	860.97	857.20	0.996x	1136.42	1138.12	1.001x
Llama3-405B/TP8-QKV	8192	2304	16384	torch.bfloat16	741.53	740.34	0.998x	1145.48	1037.09	0.905x
Llama3-405B/TP8-AttnOut	8192	16384	2048	torch.bfloat16	969.80	969.06	0.999x	882.59	810.85	0.919x
Llama3-405B/TP8-GateUp	8192	13312	16384	torch.bfloat16	945.27	974.19	1.031x	1296.04	1276.34	0.985x
Llama3-405B/TP8-Down	8192	16384	6656	torch.bfloat16	913.89	982.00	1.075x	1208.61	1154.32	0.955x
Qwen2.5-7B/TP1-QKV	1024	4608	3584	torch.bfloat16	216.95	215.66	0.994x	168.68	113.82	0.675x
Qwen2.5-7B/TP1-AttnOut	1024	3584	3584	torch.bfloat16	171.68	168.54	0.982x	141.76	88.56	0.625x
Qwen2.5-7B/TP1-GateUp	1024	37888	3584	torch.bfloat16	464.55	465.51	1.002x	915.88	923.65	1.008x
Qwen2.5-7B/TP1-Down	1024	3584	18944	torch.bfloat16	397.47	400.37	1.007x	985.68	660.54	0.670x
Qwen2.5-7B/TP1-QKV	2048	4608	3584	torch.bfloat16	438.89	430.90	0.982x	335.96	222.31	0.662x
Qwen2.5-7B/TP1-AttnOut	2048	3584	3584	torch.bfloat16	340.03	334.53	0.984x	272.67	173.02	0.635x
Qwen2.5-7B/TP1-GateUp	2048	37888	3584	torch.bfloat16	606.30	676.60	1.116x	1020.92	935.10	0.916x
Qwen2.5-7B/TP1-Down	2048	3584	18944	torch.bfloat16	605.81	596.32	0.984x	1001.46	1020.38	1.019x
Qwen2.5-7B/TP1-QKV	4096	4608	3584	torch.bfloat16	780.96	443.45	0.568x	737.00	575.81	0.781x
Qwen2.5-7B/TP1-AttnOut	4096	3584	3584	torch.bfloat16	618.43	632.15	1.022x	573.33	348.09	0.607x
Qwen2.5-7B/TP1-GateUp	4096	37888	3584	torch.bfloat16	879.99	894.78	1.017x	1058.47	1053.23	0.995x
Qwen2.5-7B/TP1-Down	4096	3584	18944	torch.bfloat16	731.04	727.08	0.995x	1158.43	1051.43	0.908x
Qwen2.5-7B/TP1-QKV	8192	4608	3584	torch.bfloat16	906.38	921.77	1.017x	1066.33	1057.70	0.992x
Qwen2.5-7B/TP1-AttnOut	8192	3584	3584	torch.bfloat16	844.81	852.78	1.009x	1020.12	784.32	0.769x
Qwen2.5-7B/TP1-GateUp	8192	37888	3584	torch.bfloat16	988.73	990.40	1.002x	1046.09	1071.67	1.024x
Qwen2.5-7B/TP1-Down	8192	3584	18944	torch.bfloat16	826.23	823.97	0.997x	1142.17	1225.21	1.073x
Qwen2.5-72B/TP8-QKV	1024	1280	8192	torch.bfloat16	138.15	135.14	0.978x	107.97	68.12	0.631x
Qwen2.5-72B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	110.14	107.58	0.977x	87.93	52.25	0.594x
Qwen2.5-72B/TP8-GateUp	1024	7392	8192	torch.bfloat16	389.01	389.72	1.002x	821.18	800.21	0.974x
Qwen2.5-72B/TP8-Down	1024	8192	3696	torch.bfloat16	330.70	334.50	1.011x	348.97	204.55	0.586x
Qwen2.5-72B/TP8-QKV	2048	1280	8192	torch.bfloat16	272.61	267.24	0.980x	218.58	135.16	0.618x
Qwen2.5-72B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	218.03	214.46	0.984x	173.37	107.57	0.620x
Qwen2.5-72B/TP8-GateUp	2048	7392	8192	torch.bfloat16	637.36	630.76	0.990x	983.71	1040.77	1.058x
Qwen2.5-72B/TP8-Down	2048	8192	3696	torch.bfloat16	327.51	483.54	1.476x	1393.90	456.44	0.327x
Qwen2.5-72B/TP8-QKV	4096	1280	8192	torch.bfloat16	526.68	315.77	0.600x	435.24	321.30	0.738x
Qwen2.5-72B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	425.69	424.82	0.998x	344.86	213.62	0.619x
Qwen2.5-72B/TP8-GateUp	4096	7392	8192	torch.bfloat16	806.53	794.63	0.985x	1090.67	1111.11	1.019x
Qwen2.5-72B/TP8-Down	4096	8192	3696	torch.bfloat16	643.16	640.06	0.995x	1008.94	1007.99	0.999x
Qwen2.5-72B/TP8-QKV	8192	1280	8192	torch.bfloat16	433.62	312.92	0.722x	938.52	1233.54	1.314x
Qwen2.5-72B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	653.31	655.29	1.003x	532.67	453.51	0.851x
Qwen2.5-72B/TP8-GateUp	8192	7392	8192	torch.bfloat16	849.70	856.01	1.007x	1240.85	1234.23	0.995x
Qwen2.5-72B/TP8-Down	8192	8192	3696	torch.bfloat16	836.43	838.64	1.003x	962.93	964.94	1.002x

benchmark_grouped_gemm (median 1.000x, min 0.439x, max 2.438x)

Case	B	M	N	K	dtype	TE (CK_Tile) Forward Base	TE (CK_Tile) Forward PR	TE (CK_Tile) Forward Speedup	TE (CK_Tile) Backward Base	TE (CK_Tile) Backward PR	TE (CK_Tile) Backward Speedup
DSV2-Lite-GateUP	2	512	2816	2048	torch.bfloat16	136.61	136.55	1.000x	146.06	146.00	1.000x
DSV2-Lite-Down	2	512	2048	1408	torch.bfloat16	94.93	94.59	0.996x	147.73	147.64	0.999x
DSV2-Lite-GateUP	2	1024	2816	2048	torch.bfloat16	107.97	263.21	2.438x	241.01	246.45	1.023x
DSV2-Lite-Down	2	1024	2048	1408	torch.bfloat16	184.01	183.88	0.999x	248.51	247.66	0.997x
DSV2-Lite-GateUP	2	2048	2816	2048	torch.bfloat16	443.08	194.70	0.439x	346.09	353.65	1.022x
DSV2-Lite-Down	2	2048	2048	1408	torch.bfloat16	339.55	341.79	1.007x	361.04	361.20	1.000x
DSV2-Lite-GateUP	2	4096	2816	2048	torch.bfloat16	451.10	452.78	1.004x	453.46	435.64	0.961x
DSV2-Lite-Down	2	4096	2048	1408	torch.bfloat16	535.49	535.92	1.001x	387.76	385.58	0.994x
DSV2-Lite-GateUP	4	512	2816	2048	torch.bfloat16	261.10	261.21	1.000x	224.11	223.33	0.997x
DSV2-Lite-Down	4	512	2048	1408	torch.bfloat16	182.07	182.05	1.000x	223.48	223.01	0.998x
DSV2-Lite-GateUP	4	1024	2816	2048	torch.bfloat16	433.24	433.93	1.002x	337.89	337.33	0.998x
DSV2-Lite-Down	4	1024	2048	1408	torch.bfloat16	334.81	333.68	0.997x	329.57	329.23	0.999x
DSV2-Lite-GateUP	4	2048	2816	2048	torch.bfloat16	425.44	424.19	0.997x	437.01	437.32	1.001x
DSV2-Lite-Down	4	2048	2048	1408	torch.bfloat16	523.87	525.98	1.004x	337.00	348.13	1.033x
DSV2-Lite-GateUP	4	4096	2816	2048	torch.bfloat16	570.78	563.79	0.988x	446.51	446.61	1.000x
DSV2-Lite-Down	4	4096	2048	1408	torch.bfloat16	544.30	554.23	1.018x	412.63	410.77	0.995x
DSV2-Lite-GateUP	8	512	2816	2048	torch.bfloat16	423.34	419.98	0.992x	335.54	332.86	0.992x
DSV2-Lite-Down	8	512	2048	1408	torch.bfloat16	325.58	324.71	0.997x	320.35	319.85	0.998x
DSV2-Lite-GateUP	8	1024	2816	2048	torch.bfloat16	407.18	405.29	0.995x	440.04	441.57	1.003x
DSV2-Lite-Down	8	1024	2048	1408	torch.bfloat16	504.75	501.55	0.994x	364.48	364.15	0.999x
DSV2-Lite-GateUP	8	2048	2816	2048	torch.bfloat16	567.09	567.78	1.001x	495.92	474.38	0.957x
DSV2-Lite-Down	8	2048	2048	1408	torch.bfloat16	536.33	536.70	1.001x	459.26	459.74	1.001x
DSV2-Lite-GateUP	8	4096	2816	2048	torch.bfloat16	611.51	615.57	1.007x	514.81	515.23	1.001x
DSV2-Lite-Down	8	4096	2048	1408	torch.bfloat16	570.06	570.33	1.000x	499.43	496.88	0.995x
DSV2-GateUP	5	512	3072	5120	torch.bfloat16	356.91	360.34	1.010x	422.01	410.45	0.973x
DSV2-Down	5	512	5120	1536	torch.bfloat16	443.37	440.61	0.994x	254.27	253.67	0.998x
DSV2-GateUP	5	1024	3072	5120	torch.bfloat16	606.34	606.51	1.000x	480.05	485.13	1.011x
DSV2-Down	5	1024	5120	1536	torch.bfloat16	459.41	460.46	1.002x	388.05	388.54	1.001x
DSV2-GateUP	5	2048	3072	5120	torch.bfloat16	579.84	579.33	0.999x	559.41	559.62	1.000x
DSV2-Down	5	2048	5120	1536	torch.bfloat16	585.94	586.28	1.001x	538.69	539.68	1.002x
DSV2-GateUP	5	4096	3072	5120	torch.bfloat16	612.67	611.57	0.998x	567.12	573.83	1.012x
DSV2-Down	5	4096	5120	1536	torch.bfloat16	590.31	589.30	0.998x	558.32	558.86	1.001x
DSV2-GateUP	10	512	3072	5120	torch.bfloat16	526.42	524.88	0.997x	418.98	417.29	0.996x
DSV2-Down	10	512	5120	1536	torch.bfloat16	446.07	446.85	1.002x	347.50	348.18	1.002x
DSV2-GateUP	10	1024	3072	5120	torch.bfloat16	493.04	490.06	0.994x	519.32	514.08	0.990x
DSV2-Down	10	1024	5120	1536	torch.bfloat16	548.76	543.97	0.991x	482.00	482.63	1.001x
DSV2-GateUP	10	2048	3072	5120	torch.bfloat16	596.15	596.18	1.000x	527.45	528.58	1.002x
DSV2-Down	10	2048	5120	1536	torch.bfloat16	544.78	546.27	1.003x	535.06	534.90	1.000x
DSV2-GateUP	10	4096	3072	5120	torch.bfloat16	654.99	653.80	0.998x	579.65	578.41	0.998x
DSV2-Down	10	4096	5120	1536	torch.bfloat16	567.31	569.53	1.004x	555.25	556.35	1.002x
DSV2-GateUP	20	512	3072	5120	torch.bfloat16	516.01	515.30	0.999x	438.67	437.76	0.998x
DSV2-Down	20	512	5120	1536	torch.bfloat16	490.73	495.44	1.010x	421.15	421.12	1.000x
DSV2-GateUP	20	1024	3072	5120	torch.bfloat16	544.96	541.73	0.994x	513.68	512.47	0.998x
DSV2-Down	20	1024	5120	1536	torch.bfloat16	512.41	512.31	1.000x	472.34	472.73	1.001x
DSV2-GateUP	20	2048	3072	5120	torch.bfloat16	641.16	641.96	1.001x	542.03	556.34	1.026x
DSV2-Down	20	2048	5120	1536	torch.bfloat16	556.16	557.08	1.002x	508.18	507.29	0.998x
DSV2-GateUP	20	4096	3072	5120	torch.bfloat16	662.95	663.15	1.000x	563.79	556.14	0.986x
DSV2-Down	20	4096	5120	1536	torch.bfloat16	546.03	548.11	1.004x	551.84	564.05	1.022x
DSV3-GateUP	8	512	4096	7168	torch.bfloat16	561.83	561.96	1.000x	439.28	439.53	1.001x
DSV3-Down	8	512	7168	2048	torch.bfloat16	506.00	502.19	0.992x	363.25	363.13	1.000x
DSV3-GateUP	8	1024	4096	7168	torch.bfloat16	593.38	594.00	1.001x	537.98	537.92	1.000x
DSV3-Down	8	1024	7168	2048	torch.bfloat16	605.11	607.23	1.004x	512.19	513.62	1.003x
DSV3-GateUP	8	2048	4096	7168	torch.bfloat16	629.67	619.95	0.985x	566.65	567.10	1.001x
DSV3-Down	8	2048	7168	2048	torch.bfloat16	628.29	630.02	1.003x	555.61	516.41	0.929x
DSV3-GateUP	8	4096	4096	7168	torch.bfloat16	656.86	682.34	1.039x	574.16	573.92	1.000x
DSV3-Down	8	4096	7168	2048	torch.bfloat16	639.12	641.34	1.003x	552.53	573.03	1.037x
DSV3-GateUP	16	512	4096	7168	torch.bfloat16	544.52	543.91	0.999x	450.89	462.98	1.027x
DSV3-Down	16	512	7168	2048	torch.bfloat16	535.66	535.10	0.999x	439.26	439.21	1.000x
DSV3-GateUP	16	1024	4096	7168	torch.bfloat16	567.41	538.63	0.949x	524.70	524.45	1.000x
DSV3-Down	16	1024	7168	2048	torch.bfloat16	594.85	595.96	1.002x	490.64	490.53	1.000x
DSV3-GateUP	16	2048	4096	7168	torch.bfloat16	669.68	638.09	0.953x	555.96	556.79	1.001x
DSV3-Down	16	2048	7168	2048	torch.bfloat16	626.21	626.69	1.001x	528.65	550.39	1.041x
DSV3-GateUP	16	4096	4096	7168	torch.bfloat16	657.35	670.19	1.020x	583.75	579.27	0.992x
DSV3-Down	16	4096	7168	2048	torch.bfloat16	632.78	632.01	0.999x	582.36	572.27	0.983x
DSV3-GateUP	32	512	4096	7168	torch.bfloat16	538.44	538.55	1.000x	426.59	426.21	0.999x
DSV3-Down	32	512	7168	2048	torch.bfloat16	526.51	459.17	0.872x	402.05	426.37	1.060x
DSV3-GateUP	32	1024	4096	7168	torch.bfloat16	644.04	642.56	0.998x	506.65	514.44	1.015x
DSV3-Down	32	1024	7168	2048	torch.bfloat16	588.36	587.85	0.999x	472.44	473.26	1.002x
DSV3-GateUP	32	2048	4096	7168	torch.bfloat16	656.46	641.24	0.977x	554.86	559.71	1.009x
DSV3-Down	32	2048	7168	2048	torch.bfloat16	613.63	613.77	1.000x	545.77	554.32	1.016x
DSV3-GateUP	32	4096	4096	7168	torch.bfloat16	667.88	674.54	1.010x	571.30	571.35	1.000x
DSV3-Down	32	4096	7168	2048	torch.bfloat16	599.49	610.80	1.019x	574.82	575.19	1.001x
Grok-V2-GateUP	1	512	32768	8192	torch.bfloat16	514.60	515.22	1.001x	553.62	555.90	1.004x
Grok-V2-Down	1	512	8192	16384	torch.bfloat16	480.71	483.43	1.006x	570.76	577.15	1.011x
Grok-V2-GateUP	1	1024	32768	8192	torch.bfloat16	614.69	614.51	1.000x	646.34	605.78	0.937x
Grok-V2-Down	1	1024	8192	16384	torch.bfloat16	534.74	533.95	0.999x	645.96	645.59	0.999x
Grok-V2-GateUP	1	2048	32768	8192	torch.bfloat16	659.29	659.61	1.000x	746.55	747.23	1.001x
Grok-V2-Down	1	2048	8192	16384	torch.bfloat16	605.26	605.95	1.001x	718.30	716.29	0.997x
Grok-V2-GateUP	1	4096	32768	8192	torch.bfloat16	724.20	724.67	1.001x	761.82	762.57	1.001x
Grok-V2-Down	1	4096	8192	16384	torch.bfloat16	626.54	627.74	1.002x	743.74	746.09	1.003x

benchmark_normalization (median 1.006x, min 0.633x, max 2.066x)

Case	M	hidden_size	dtype	TE Forward GB/s Base	TE Forward GB/s PR	TE Forward GB/s Speedup	TE Backward GB/s Base	TE Backward GB/s PR	TE Backward GB/s Speedup
Llama3-8B/RMSNorm	1024	4096	torch.bfloat16	609.30	578.90	0.950x	312.20	644.60	2.065x
Llama3-8B/RMSNorm	2048	4096	torch.bfloat16	1205.50	1163.00	0.965x	643.80	1329.80	2.066x
Llama3-8B/RMSNorm	4096	4096	torch.bfloat16	2449.20	2366.20	0.966x	2595.00	2612.60	1.007x
Llama3-8B/RMSNorm	8192	4096	torch.bfloat16	3812.50	3856.40	1.012x	4284.00	4211.50	0.983x
Llama3-8B/LayerNorm	1024	4096	torch.bfloat16	528.60	491.40	0.930x	561.00	591.70	1.055x
Llama3-8B/LayerNorm	2048	4096	torch.bfloat16	1045.80	1020.30	0.976x	1111.10	1144.10	1.030x
Llama3-8B/LayerNorm	4096	4096	torch.bfloat16	2031.30	2021.20	0.995x	2286.40	2294.30	1.003x
Llama3-8B/LayerNorm	8192	4096	torch.bfloat16	3497.10	3520.10	1.007x	4018.80	3071.20	0.764x
Llama3-70B/RMSNorm	1024	8192	torch.bfloat16	1215.30	1176.60	0.968x	1320.10	844.50	0.640x
Llama3-70B/RMSNorm	2048	8192	torch.bfloat16	2402.30	2365.50	0.985x	2684.10	1698.30	0.633x
Llama3-70B/RMSNorm	4096	8192	torch.bfloat16	3086.30	3054.70	0.990x	3292.10	3278.60	0.996x
Llama3-70B/RMSNorm	8192	8192	torch.bfloat16	3639.30	3637.10	0.999x	3560.70	3486.10	0.979x
Llama3-70B/LayerNorm	1024	8192	torch.bfloat16	1034.70	1024.60	0.990x	657.30	556.50	0.847x
Llama3-70B/LayerNorm	2048	8192	torch.bfloat16	2054.00	2042.30	0.994x	731.70	737.80	1.008x
Llama3-70B/LayerNorm	4096	8192	torch.bfloat16	2970.00	2961.50	0.997x	640.90	678.80	1.059x
Llama3-70B/LayerNorm	8192	8192	torch.bfloat16	3524.30	3538.60	1.004x	599.10	625.10	1.043x
Llama3-405B/RMSNorm	1024	16384	torch.bfloat16	664.60	681.30	1.025x	536.60	519.80	0.969x
Llama3-405B/RMSNorm	2048	16384	torch.bfloat16	714.30	718.80	1.006x	522.70	535.30	1.024x
Llama3-405B/RMSNorm	4096	16384	torch.bfloat16	729.40	726.20	0.996x	455.10	464.00	1.020x
Llama3-405B/RMSNorm	8192	16384	torch.bfloat16	608.40	608.50	1.000x	477.20	479.80	1.005x
Llama3-405B/LayerNorm	1024	16384	torch.bfloat16	672.50	681.50	1.013x	614.40	624.50	1.016x
Llama3-405B/LayerNorm	2048	16384	torch.bfloat16	673.40	672.10	0.998x	597.90	637.10	1.066x
Llama3-405B/LayerNorm	4096	16384	torch.bfloat16	701.00	701.40	1.001x	522.90	548.70	1.049x
Llama3-405B/LayerNorm	8192	16384	torch.bfloat16	577.60	582.30	1.008x	606.40	607.30	1.001x
Qwen2.5-7B/RMSNorm	1024	3584	torch.bfloat16	522.70	525.70	1.006x	181.50	296.20	1.632x
Qwen2.5-7B/RMSNorm	2048	3584	torch.bfloat16	1049.10	1036.50	0.988x	377.00	588.70	1.562x
Qwen2.5-7B/RMSNorm	4096	3584	torch.bfloat16	2108.70	2067.30	0.980x	734.20	997.60	1.359x
Qwen2.5-7B/RMSNorm	8192	3584	torch.bfloat16	2905.40	2927.20	1.008x	1521.10	1640.50	1.078x
Qwen2.5-7B/LayerNorm	1024	3584	torch.bfloat16	444.50	447.00	1.006x	146.90	248.20	1.690x
Qwen2.5-7B/LayerNorm	2048	3584	torch.bfloat16	902.80	899.30	0.996x	296.40	498.60	1.682x
Qwen2.5-7B/LayerNorm	4096	3584	torch.bfloat16	1767.00	1768.60	1.001x	583.20	891.80	1.529x
Qwen2.5-7B/LayerNorm	8192	3584	torch.bfloat16	2531.80	2573.90	1.017x	1223.00	1436.60	1.175x
Qwen2.5-72B/RMSNorm	1024	8192	torch.bfloat16	1168.60	1188.60	1.017x	396.20	672.10	1.696x
Qwen2.5-72B/RMSNorm	2048	8192	torch.bfloat16	2348.80	2354.40	1.002x	816.50	1374.00	1.683x
Qwen2.5-72B/RMSNorm	4096	8192	torch.bfloat16	3032.00	3021.70	0.997x	1770.80	2928.60	1.654x
Qwen2.5-72B/RMSNorm	8192	8192	torch.bfloat16	3568.60	3604.10	1.010x	3541.80	3523.90	0.995x
Qwen2.5-72B/LayerNorm	1024	8192	torch.bfloat16	1039.00	1006.90	0.969x	319.90	543.40	1.699x
Qwen2.5-72B/LayerNorm	2048	8192	torch.bfloat16	2085.00	2058.50	0.987x	645.70	741.00	1.148x
Qwen2.5-72B/LayerNorm	4096	8192	torch.bfloat16	2987.20	2967.70	0.993x	643.30	679.60	1.056x
Qwen2.5-72B/LayerNorm	8192	8192	torch.bfloat16	3434.70	3502.90	1.020x	592.70	626.00	1.056x

MI355

_{PR commit: ddd17d4 | Base: dev | 2026-03-17 18:30:52 CDT}

Benchmark suite	Median speedup	Min speedup	Max speedup
benchmark_attention	1.000x	0.969x	1.014x
benchmark_casting	0.998x	0.898x	1.138x
benchmark_gemm	1.001x	0.935x	1.094x
benchmark_gemm_fp8	0.988x	0.401x	1.038x
benchmark_grouped_gemm	0.999x	0.912x	1.134x
benchmark_normalization	0.993x	0.399x	1.490x

benchmark_attention (median 1.000x, min 0.969x, max 1.014x)

Case	batch	seq_len	num_q_heads	num_kv_heads	head_dim	TE Forward Base	TE Forward PR	TE Forward Speedup	TE Backward Base	TE Backward PR	TE Backward Speedup
Llama3-8B/TP1	2	1024	32	8	128	298.06	298.56	1.002x	246.01	238.32	0.969x
Llama3-8B/TP1	2	2048	32	8	128	531.54	535.00	1.007x	273.78	273.40	0.999x
Llama3-8B/TP1	2	4096	32	8	128	681.26	681.83	1.001x	297.55	298.62	1.004x
Llama3-8B/TP1	2	8192	32	8	128	943.78	946.38	1.003x	392.95	393.58	1.002x
Llama3-8B/TP8	2	1024	4	1	128	38.23	37.85	0.990x	38.08	38.14	1.002x
Llama3-8B/TP8	2	2048	4	1	128	151.71	149.89	0.988x	146.00	146.67	1.005x
Llama3-8B/TP8	2	4096	4	1	128	421.15	417.61	0.992x	289.94	290.16	1.001x
Llama3-8B/TP8	2	8192	4	1	128	697.50	700.77	1.005x	312.90	313.96	1.003x
Llama3-70B/TP8	2	1024	8	1	128	75.79	75.91	1.002x	76.97	75.59	0.982x
Llama3-70B/TP8	2	2048	8	1	128	304.14	301.02	0.990x	270.62	274.53	1.014x
Llama3-70B/TP8	2	4096	8	1	128	615.32	616.78	1.002x	301.91	301.68	0.999x
Llama3-70B/TP8	2	8192	8	1	128	752.84	759.49	1.009x	351.40	353.16	1.005x
Llama3-405B/TP8	2	1024	16	1	128	153.61	152.41	0.992x	152.84	153.13	1.002x
Llama3-405B/TP8	2	2048	16	1	128	494.52	496.16	1.003x	277.21	277.06	0.999x
Llama3-405B/TP8	2	4096	16	1	128	667.38	670.21	1.004x	298.69	298.46	0.999x
Llama3-405B/TP8	2	8192	16	1	128	900.01	900.16	1.000x	382.68	382.27	0.999x
Qwen2.5-7B/TP1	2	1024	28	4	128	266.49	262.94	0.987x	221.11	222.98	1.008x
Qwen2.5-7B/TP1	2	2048	28	4	128	491.33	491.07	0.999x	247.61	247.99	1.002x
Qwen2.5-7B/TP1	2	4096	28	4	128	685.47	685.32	1.000x	302.15	302.30	1.000x
Qwen2.5-7B/TP1	2	8192	28	4	128	953.99	952.98	0.999x	395.01	394.68	0.999x
Qwen2.5-72B/TP8	2	1024	8	1	128	76.69	75.99	0.991x	76.62	75.80	0.989x
Qwen2.5-72B/TP8	2	2048	8	1	128	306.02	302.08	0.987x	271.20	274.05	1.011x
Qwen2.5-72B/TP8	2	4096	8	1	128	618.80	617.37	0.998x	301.97	302.10	1.000x
Qwen2.5-72B/TP8	2	8192	8	1	128	752.92	756.74	1.005x	353.63	353.38	0.999x

benchmark_casting (median 0.998x, min 0.898x, max 1.138x)

Case	M	hidden_size	dtype_str	Cast GB/s Base	Cast GB/s PR	Cast GB/s Speedup
Llama3-8B/BF16-to-FP8-E4M3	1024	4096	BF16-to-FP8-E4M3	764.70	756.30	0.989x
Llama3-8B/BF16-to-FP8-E4M3	2048	4096	BF16-to-FP8-E4M3	1224.70	1232.50	1.006x
Llama3-8B/BF16-to-FP8-E4M3	4096	4096	BF16-to-FP8-E4M3	2145.90	2178.90	1.015x
Llama3-8B/BF16-to-FP8-E4M3	8192	4096	BF16-to-FP8-E4M3	1738.20	1746.90	1.005x
Llama3-8B/FP8-E4M3-to-BF16	1024	4096	FP8-E4M3-to-BF16	1160.80	1092.60	0.941x
Llama3-8B/FP8-E4M3-to-BF16	2048	4096	FP8-E4M3-to-BF16	2545.40	2392.40	0.940x
Llama3-8B/FP8-E4M3-to-BF16	4096	4096	FP8-E4M3-to-BF16	4642.80	4641.40	1.000x
Llama3-8B/FP8-E4M3-to-BF16	8192	4096	FP8-E4M3-to-BF16	5711.60	5694.70	0.997x
Llama3-8B/BF16-to-FP8-E5M2	1024	4096	BF16-to-FP8-E5M2	764.10	763.20	0.999x
Llama3-8B/BF16-to-FP8-E5M2	2048	4096	BF16-to-FP8-E5M2	1221.40	1234.10	1.010x
Llama3-8B/BF16-to-FP8-E5M2	4096	4096	BF16-to-FP8-E5M2	2159.10	2178.30	1.009x
Llama3-8B/BF16-to-FP8-E5M2	8192	4096	BF16-to-FP8-E5M2	1742.10	1750.10	1.005x
Llama3-8B/FP8-E5M2-to-BF16	1024	4096	FP8-E5M2-to-BF16	1249.40	1122.30	0.898x
Llama3-8B/FP8-E5M2-to-BF16	2048	4096	FP8-E5M2-to-BF16	2542.20	2353.60	0.926x
Llama3-8B/FP8-E5M2-to-BF16	4096	4096	FP8-E5M2-to-BF16	4713.90	4636.80	0.984x
Llama3-8B/FP8-E5M2-to-BF16	8192	4096	FP8-E5M2-to-BF16	5684.70	5697.40	1.002x
Llama3-70B/BF16-to-FP8-E4M3	1024	8192	BF16-to-FP8-E4M3	1248.20	1253.10	1.004x
Llama3-70B/BF16-to-FP8-E4M3	2048	8192	BF16-to-FP8-E4M3	2254.30	2260.60	1.003x
Llama3-70B/BF16-to-FP8-E4M3	4096	8192	BF16-to-FP8-E4M3	1584.10	1802.80	1.138x
Llama3-70B/BF16-to-FP8-E4M3	8192	8192	BF16-to-FP8-E4M3	1903.50	1903.10	1.000x
Llama3-70B/FP8-E4M3-to-BF16	1024	8192	FP8-E4M3-to-BF16	2533.20	2439.10	0.963x
Llama3-70B/FP8-E4M3-to-BF16	2048	8192	FP8-E4M3-to-BF16	4719.90	4698.20	0.995x
Llama3-70B/FP8-E4M3-to-BF16	4096	8192	FP8-E4M3-to-BF16	5733.00	5705.20	0.995x
Llama3-70B/FP8-E4M3-to-BF16	8192	8192	FP8-E4M3-to-BF16	6595.90	6578.60	0.997x
Llama3-70B/BF16-to-FP8-E5M2	1024	8192	BF16-to-FP8-E5M2	1253.10	1261.70	1.007x
Llama3-70B/BF16-to-FP8-E5M2	2048	8192	BF16-to-FP8-E5M2	2257.00	2269.50	1.006x
Llama3-70B/BF16-to-FP8-E5M2	4096	8192	BF16-to-FP8-E5M2	1812.90	1801.70	0.994x
Llama3-70B/BF16-to-FP8-E5M2	8192	8192	BF16-to-FP8-E5M2	1902.80	1871.40	0.983x
Llama3-70B/FP8-E5M2-to-BF16	1024	8192	FP8-E5M2-to-BF16	2548.60	2453.50	0.963x
Llama3-70B/FP8-E5M2-to-BF16	2048	8192	FP8-E5M2-to-BF16	4702.40	4717.80	1.003x
Llama3-70B/FP8-E5M2-to-BF16	4096	8192	FP8-E5M2-to-BF16	5736.00	5727.50	0.999x
Llama3-70B/FP8-E5M2-to-BF16	8192	8192	FP8-E5M2-to-BF16	6451.10	6457.60	1.001x
Llama3-405B/BF16-to-FP8-E4M3	1024	16384	BF16-to-FP8-E4M3	2267.60	2261.30	0.997x
Llama3-405B/BF16-to-FP8-E4M3	2048	16384	BF16-to-FP8-E4M3	1674.70	1669.70	0.997x
Llama3-405B/BF16-to-FP8-E4M3	4096	16384	BF16-to-FP8-E4M3	1866.20	1867.10	1.000x
Llama3-405B/BF16-to-FP8-E4M3	8192	16384	BF16-to-FP8-E4M3	1080.60	1091.40	1.010x
Llama3-405B/FP8-E4M3-to-BF16	1024	16384	FP8-E4M3-to-BF16	4635.60	4611.80	0.995x
Llama3-405B/FP8-E4M3-to-BF16	2048	16384	FP8-E4M3-to-BF16	5696.40	5685.10	0.998x
Llama3-405B/FP8-E4M3-to-BF16	4096	16384	FP8-E4M3-to-BF16	6602.00	6505.70	0.985x
Llama3-405B/FP8-E4M3-to-BF16	8192	16384	FP8-E4M3-to-BF16	5371.20	5383.90	1.002x
Llama3-405B/BF16-to-FP8-E5M2	1024	16384	BF16-to-FP8-E5M2	2268.10	2269.90	1.001x
Llama3-405B/BF16-to-FP8-E5M2	2048	16384	BF16-to-FP8-E5M2	1673.80	1670.60	0.998x
Llama3-405B/BF16-to-FP8-E5M2	4096	16384	BF16-to-FP8-E5M2	1860.20	1863.20	1.002x
Llama3-405B/BF16-to-FP8-E5M2	8192	16384	BF16-to-FP8-E5M2	1076.80	1083.20	1.006x
Llama3-405B/FP8-E5M2-to-BF16	1024	16384	FP8-E5M2-to-BF16	4643.10	4630.50	0.997x
Llama3-405B/FP8-E5M2-to-BF16	2048	16384	FP8-E5M2-to-BF16	5709.10	5703.70	0.999x
Llama3-405B/FP8-E5M2-to-BF16	4096	16384	FP8-E5M2-to-BF16	6578.80	6595.10	1.002x
Llama3-405B/FP8-E5M2-to-BF16	8192	16384	FP8-E5M2-to-BF16	5379.10	5371.30	0.999x
Qwen2.5-7B/BF16-to-FP8-E4M3	1024	3584	BF16-to-FP8-E4M3	711.50	684.10	0.961x
Qwen2.5-7B/BF16-to-FP8-E4M3	2048	3584	BF16-to-FP8-E4M3	1187.40	1195.50	1.007x
Qwen2.5-7B/BF16-to-FP8-E4M3	4096	3584	BF16-to-FP8-E4M3	2155.40	2156.50	1.001x
Qwen2.5-7B/BF16-to-FP8-E4M3	8192	3584	BF16-to-FP8-E4M3	2538.90	2645.00	1.042x
Qwen2.5-7B/FP8-E4M3-to-BF16	1024	3584	FP8-E4M3-to-BF16	1109.70	1078.90	0.972x
Qwen2.5-7B/FP8-E4M3-to-BF16	2048	3584	FP8-E4M3-to-BF16	2205.50	2084.20	0.945x
Qwen2.5-7B/FP8-E4M3-to-BF16	4096	3584	FP8-E4M3-to-BF16	4411.40	4350.20	0.986x
Qwen2.5-7B/FP8-E4M3-to-BF16	8192	3584	FP8-E4M3-to-BF16	5542.90	5526.80	0.997x
Qwen2.5-7B/BF16-to-FP8-E5M2	1024	3584	BF16-to-FP8-E5M2	711.50	688.30	0.967x
Qwen2.5-7B/BF16-to-FP8-E5M2	2048	3584	BF16-to-FP8-E5M2	1195.20	1195.50	1.000x
Qwen2.5-7B/BF16-to-FP8-E5M2	4096	3584	BF16-to-FP8-E5M2	2151.90	2157.80	1.003x
Qwen2.5-7B/BF16-to-FP8-E5M2	8192	3584	BF16-to-FP8-E5M2	2540.10	2647.00	1.042x
Qwen2.5-7B/FP8-E5M2-to-BF16	1024	3584	FP8-E5M2-to-BF16	1048.60	1077.50	1.028x
Qwen2.5-7B/FP8-E5M2-to-BF16	2048	3584	FP8-E5M2-to-BF16	2245.20	2116.50	0.943x
Qwen2.5-7B/FP8-E5M2-to-BF16	4096	3584	FP8-E5M2-to-BF16	4392.40	4353.40	0.991x
Qwen2.5-7B/FP8-E5M2-to-BF16	8192	3584	FP8-E5M2-to-BF16	5548.20	5512.70	0.994x
Qwen2.5-72B/BF16-to-FP8-E4M3	1024	8192	BF16-to-FP8-E4M3	1251.50	1255.90	1.004x
Qwen2.5-72B/BF16-to-FP8-E4M3	2048	8192	BF16-to-FP8-E4M3	2256.70	2264.90	1.004x
Qwen2.5-72B/BF16-to-FP8-E4M3	4096	8192	BF16-to-FP8-E4M3	1811.40	1805.90	0.997x
Qwen2.5-72B/BF16-to-FP8-E4M3	8192	8192	BF16-to-FP8-E4M3	1902.90	1871.90	0.984x
Qwen2.5-72B/FP8-E4M3-to-BF16	1024	8192	FP8-E4M3-to-BF16	2540.70	2473.90	0.974x
Qwen2.5-72B/FP8-E4M3-to-BF16	2048	8192	FP8-E4M3-to-BF16	4730.20	4719.70	0.998x
Qwen2.5-72B/FP8-E4M3-to-BF16	4096	8192	FP8-E4M3-to-BF16	5728.30	5735.20	1.001x
Qwen2.5-72B/FP8-E4M3-to-BF16	8192	8192	FP8-E4M3-to-BF16	6587.30	6529.40	0.991x
Qwen2.5-72B/BF16-to-FP8-E5M2	1024	8192	BF16-to-FP8-E5M2	1252.70	1258.20	1.004x
Qwen2.5-72B/BF16-to-FP8-E5M2	2048	8192	BF16-to-FP8-E5M2	2258.40	2267.30	1.004x
Qwen2.5-72B/BF16-to-FP8-E5M2	4096	8192	BF16-to-FP8-E5M2	1813.80	1802.50	0.994x
Qwen2.5-72B/BF16-to-FP8-E5M2	8192	8192	BF16-to-FP8-E5M2	1902.40	1873.90	0.985x
Qwen2.5-72B/FP8-E5M2-to-BF16	1024	8192	FP8-E5M2-to-BF16	2557.20	2488.80	0.973x
Qwen2.5-72B/FP8-E5M2-to-BF16	2048	8192	FP8-E5M2-to-BF16	4714.20	4714.40	1.000x
Qwen2.5-72B/FP8-E5M2-to-BF16	4096	8192	FP8-E5M2-to-BF16	5743.10	5715.60	0.995x
Qwen2.5-72B/FP8-E5M2-to-BF16	8192	8192	FP8-E5M2-to-BF16	6603.50	6591.70	0.998x

benchmark_gemm (median 1.001x, min 0.935x, max 1.094x)

Case	M	N	K	dtype	TE Forward Base	TE Forward PR	TE Forward Speedup	TE Backward Base	TE Backward PR	TE Backward Speedup
Llama3-8B/TP1-QKV	1024	6144	4096	torch.bfloat16	1031.56	1044.11	1.012x	759.28	764.70	1.007x
Llama3-8B/TP1-AttnOut	1024	4096	4096	torch.bfloat16	806.21	817.18	1.014x	474.12	469.24	0.990x
Llama3-8B/TP1-GateUp	1024	28672	4096	torch.bfloat16	1235.92	1218.47	0.986x	1218.56	1232.06	1.011x
Llama3-8B/TP1-Down	1024	4096	14336	torch.bfloat16	1090.59	1084.23	0.994x	1097.33	1091.06	0.994x
Llama3-8B/TP1-QKV	2048	6144	4096	torch.bfloat16	1312.34	1303.96	0.994x	1054.29	1061.04	1.006x
Llama3-8B/TP1-AttnOut	2048	4096	4096	torch.bfloat16	722.97	675.64	0.935x	1035.42	1014.34	0.980x
Llama3-8B/TP1-GateUp	2048	28672	4096	torch.bfloat16	1296.36	1295.50	0.999x	1364.19	1364.80	1.000x
Llama3-8B/TP1-Down	2048	4096	14336	torch.bfloat16	1172.73	1243.84	1.061x	1263.38	1218.60	0.965x
Llama3-8B/TP1-QKV	4096	6144	4096	torch.bfloat16	1347.05	1329.77	0.987x	1243.39	1249.03	1.005x
Llama3-8B/TP1-AttnOut	4096	4096	4096	torch.bfloat16	1458.15	1458.33	1.000x	1377.25	1375.15	0.998x
Llama3-8B/TP1-GateUp	4096	28672	4096	torch.bfloat16	1552.05	1553.90	1.001x	1426.28	1406.21	0.986x
Llama3-8B/TP1-Down	4096	4096	14336	torch.bfloat16	1596.08	1588.77	0.995x	1349.14	1344.52	0.997x
Llama3-8B/TP1-QKV	8192	6144	4096	torch.bfloat16	1533.34	1537.14	1.002x	1261.16	1263.60	1.002x
Llama3-8B/TP1-AttnOut	8192	4096	4096	torch.bfloat16	1531.56	1522.12	0.994x	1440.07	1437.99	0.999x
Llama3-8B/TP1-GateUp	8192	28672	4096	torch.bfloat16	1531.03	1540.89	1.006x	1431.77	1441.42	1.007x
Llama3-8B/TP1-Down	8192	4096	14336	torch.bfloat16	1572.67	1566.68	0.996x	1380.70	1384.38	1.003x
Llama3-8B/TP8-QKV	1024	768	4096	torch.bfloat16	163.76	161.84	0.988x	85.91	88.00	1.024x
Llama3-8B/TP8-AttnOut	1024	4096	512	torch.bfloat16	106.91	107.50	1.006x	58.26	57.43	0.986x
Llama3-8B/TP8-GateUp	1024	3584	4096	torch.bfloat16	730.61	726.25	0.994x	400.66	408.95	1.021x
Llama3-8B/TP8-Down	1024	4096	1792	torch.bfloat16	375.78	373.99	0.995x	200.18	203.59	1.017x
Llama3-8B/TP8-QKV	2048	768	4096	torch.bfloat16	321.54	320.20	0.996x	174.12	174.79	1.004x
Llama3-8B/TP8-AttnOut	2048	4096	512	torch.bfloat16	216.19	213.79	0.989x	113.80	115.11	1.012x
Llama3-8B/TP8-GateUp	2048	3584	4096	torch.bfloat16	917.40	918.18	1.001x	973.01	990.20	1.018x
Llama3-8B/TP8-Down	2048	4096	1792	torch.bfloat16	757.40	745.23	0.984x	406.19	411.89	1.014x
Llama3-8B/TP8-QKV	4096	768	4096	torch.bfloat16	647.06	638.18	0.986x	343.95	347.77	1.011x
Llama3-8B/TP8-AttnOut	4096	4096	512	torch.bfloat16	432.88	439.18	1.015x	232.59	234.34	1.008x
Llama3-8B/TP8-GateUp	4096	3584	4096	torch.bfloat16	1249.15	1246.90	0.998x	1379.01	1378.43	1.000x
Llama3-8B/TP8-Down	4096	4096	1792	torch.bfloat16	1340.86	1351.01	1.008x	780.96	760.32	0.974x
Llama3-8B/TP8-QKV	8192	768	4096	torch.bfloat16	1011.14	1046.41	1.035x	753.62	740.72	0.983x
Llama3-8B/TP8-AttnOut	8192	4096	512	torch.bfloat16	864.57	861.49	0.996x	455.36	456.11	1.002x
Llama3-8B/TP8-GateUp	8192	3584	4096	torch.bfloat16	1331.39	1340.78	1.007x	1410.76	1401.87	0.994x
Llama3-8B/TP8-Down	8192	4096	1792	torch.bfloat16	1384.58	1401.17	1.012x	1118.15	1109.03	0.992x
Llama3-70B/TP8-QKV	1024	1280	8192	torch.bfloat16	490.69	497.37	1.014x	291.37	292.00	1.002x
Llama3-70B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	434.30	432.02	0.995x	223.56	227.47	1.017x
Llama3-70B/TP8-GateUp	1024	7168	8192	torch.bfloat16	1071.37	1090.79	1.018x	1015.08	1024.66	1.009x
Llama3-70B/TP8-Down	1024	8192	3584	torch.bfloat16	1166.13	1168.12	1.002x	851.74	854.46	1.003x
Llama3-70B/TP8-QKV	2048	1280	8192	torch.bfloat16	754.44	762.55	1.011x	630.23	620.15	0.984x
Llama3-70B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	885.39	893.65	1.009x	453.19	453.42	1.001x
Llama3-70B/TP8-GateUp	2048	7168	8192	torch.bfloat16	1405.45	1397.94	0.995x	1292.32	1292.95	1.000x
Llama3-70B/TP8-Down	2048	8192	3584	torch.bfloat16	1457.79	1453.66	0.997x	1163.03	1179.82	1.014x
Llama3-70B/TP8-QKV	4096	1280	8192	torch.bfloat16	1025.43	1005.08	0.980x	1022.05	1010.27	0.988x
Llama3-70B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	1217.94	1220.35	1.002x	787.22	787.92	1.001x
Llama3-70B/TP8-GateUp	4096	7168	8192	torch.bfloat16	1430.36	1430.71	1.000x	1382.28	1389.21	1.005x
Llama3-70B/TP8-Down	4096	8192	3584	torch.bfloat16	1500.80	1505.96	1.003x	1381.77	1368.31	0.990x
Llama3-70B/TP8-QKV	8192	1280	8192	torch.bfloat16	1303.64	1305.61	1.002x	1041.07	1052.20	1.011x
Llama3-70B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	1222.76	1216.39	0.995x	1032.74	1058.58	1.025x
Llama3-70B/TP8-GateUp	8192	7168	8192	torch.bfloat16	1409.23	1393.83	0.989x	1420.60	1409.07	0.992x
Llama3-70B/TP8-Down	8192	8192	3584	torch.bfloat16	1528.03	1538.20	1.007x	1404.21	1406.70	1.002x
Llama3-405B/TP8-QKV	1024	2304	16384	torch.bfloat16	904.21	898.17	0.993x	1060.39	1036.51	0.977x
Llama3-405B/TP8-AttnOut	1024	16384	2048	torch.bfloat16	1374.74	1355.05	0.986x	944.64	946.03	1.001x
Llama3-405B/TP8-GateUp	1024	13312	16384	torch.bfloat16	1247.84	1250.03	1.002x	1266.71	1309.21	1.034x
Llama3-405B/TP8-Down	1024	16384	6656	torch.bfloat16	1557.68	1556.10	0.999x	1104.87	1098.97	0.995x
Llama3-405B/TP8-QKV	2048	2304	16384	torch.bfloat16	942.92	939.66	0.997x	1158.83	1144.05	0.987x
Llama3-405B/TP8-AttnOut	2048	16384	2048	torch.bfloat16	1405.33	1393.94	0.992x	1232.41	1250.84	1.015x
Llama3-405B/TP8-GateUp	2048	13312	16384	torch.bfloat16	1249.54	1235.35	0.989x	1396.91	1373.03	0.983x
Llama3-405B/TP8-Down	2048	16384	6656	torch.bfloat16	1563.50	1572.38	1.006x	1266.70	1275.01	1.007x
Llama3-405B/TP8-QKV	4096	2304	16384	torch.bfloat16	1311.34	1309.76	0.999x	1178.44	1179.30	1.001x
Llama3-405B/TP8-AttnOut	4096	16384	2048	torch.bfloat16	1459.78	1444.47	0.990x	1383.79	1359.37	0.982x
Llama3-405B/TP8-GateUp	4096	13312	16384	torch.bfloat16	1253.45	1254.54	1.001x	1424.14	1430.10	1.004x
Llama3-405B/TP8-Down	4096	16384	6656	torch.bfloat16	1563.16	1553.43	0.994x	1269.96	1281.38	1.009x
Llama3-405B/TP8-QKV	8192	2304	16384	torch.bfloat16	1196.13	1192.27	0.997x	1197.02	1197.63	1.001x
Llama3-405B/TP8-AttnOut	8192	16384	2048	torch.bfloat16	1444.72	1462.17	1.012x	1435.09	1405.15	0.979x
Llama3-405B/TP8-GateUp	8192	13312	16384	torch.bfloat16	1305.87	1295.48	0.992x	1433.19	1434.02	1.001x
Llama3-405B/TP8-Down	8192	16384	6656	torch.bfloat16	1562.92	1548.92	0.991x	1329.06	1312.97	0.988x
Qwen2.5-7B/TP1-QKV	1024	4608	3584	torch.bfloat16	834.45	839.43	1.006x	454.28	461.59	1.016x
Qwen2.5-7B/TP1-AttnOut	1024	3584	3584	torch.bfloat16	658.91	650.06	0.987x	350.44	359.77	1.027x
Qwen2.5-7B/TP1-GateUp	1024	37888	3584	torch.bfloat16	1039.15	1057.32	1.017x	1105.16	1092.41	0.988x
Qwen2.5-7B/TP1-Down	1024	3584	18944	torch.bfloat16	1135.23	1151.80	1.015x	929.26	941.09	1.013x
Qwen2.5-7B/TP1-QKV	2048	4608	3584	torch.bfloat16	1123.78	1145.57	1.019x	948.76	1038.27	1.094x
Qwen2.5-7B/TP1-AttnOut	2048	3584	3584	torch.bfloat16	950.35	993.29	1.045x	778.31	772.29	0.992x
Qwen2.5-7B/TP1-GateUp	2048	37888	3584	torch.bfloat16	1207.31	1177.14	0.975x	1240.33	1205.00	0.972x
Qwen2.5-7B/TP1-Down	2048	3584	18944	torch.bfloat16	1317.45	1314.54	0.998x	1103.15	1116.11	1.012x
Qwen2.5-7B/TP1-QKV	4096	4608	3584	torch.bfloat16	1176.15	1168.29	0.993x	1331.64	1348.98	1.013x
Qwen2.5-7B/TP1-AttnOut	4096	3584	3584	torch.bfloat16	1288.68	1299.03	1.008x	1263.45	1263.49	1.000x
Qwen2.5-7B/TP1-GateUp	4096	37888	3584	torch.bfloat16	1229.85	1236.07	1.005x	1349.78	1346.28	0.997x
Qwen2.5-7B/TP1-Down	4096	3584	18944	torch.bfloat16	1397.97	1397.86	1.000x	1302.88	1297.94	0.996x
Qwen2.5-7B/TP1-QKV	8192	4608	3584	torch.bfloat16	1292.97	1334.65	1.032x	1318.98	1309.79	0.993x
Qwen2.5-7B/TP1-AttnOut	8192	3584	3584	torch.bfloat16	1324.98	1306.60	0.986x	1223.76	1234.15	1.008x
Qwen2.5-7B/TP1-GateUp	8192	37888	3584	torch.bfloat16	1423.73	1427.85	1.003x	1318.67	1314.88	0.997x
Qwen2.5-7B/TP1-Down	8192	3584	18944	torch.bfloat16	1397.80	1397.58	1.000x	1288.21	1289.39	1.001x
Qwen2.5-72B/TP8-QKV	1024	1280	8192	torch.bfloat16	493.81	486.35	0.985x	286.99	293.77	1.024x
Qwen2.5-72B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	432.26	434.02	1.004x	223.05	229.12	1.027x
Qwen2.5-72B/TP8-GateUp	1024	7392	8192	torch.bfloat16	1074.15	1065.03	0.992x	1057.86	1059.16	1.001x
Qwen2.5-72B/TP8-Down	1024	8192	3696	torch.bfloat16	607.86	584.66	0.962x	761.73	784.62	1.030x
Qwen2.5-72B/TP8-QKV	2048	1280	8192	torch.bfloat16	762.10	741.45	0.973x	620.88	645.59	1.040x
Qwen2.5-72B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	878.78	884.70	1.007x	454.10	465.87	1.026x
Qwen2.5-72B/TP8-GateUp	2048	7392	8192	torch.bfloat16	1336.88	1346.57	1.007x	1315.67	1324.50	1.007x
Qwen2.5-72B/TP8-Down	2048	8192	3696	torch.bfloat16	1301.85	1310.54	1.007x	1046.29	1038.17	0.992x
Qwen2.5-72B/TP8-QKV	4096	1280	8192	torch.bfloat16	1032.78	1009.95	0.978x	1017.51	1018.11	1.001x
Qwen2.5-72B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	1213.82	1223.14	1.008x	793.37	791.32	0.997x
Qwen2.5-72B/TP8-GateUp	4096	7392	8192	torch.bfloat16	1390.24	1387.97	0.998x	1360.55	1358.70	0.999x
Qwen2.5-72B/TP8-Down	4096	8192	3696	torch.bfloat16	1345.45	1339.22	0.995x	1263.22	1261.03	0.998x
Qwen2.5-72B/TP8-QKV	8192	1280	8192	torch.bfloat16	1319.24	1309.42	0.993x	1045.93	1056.10	1.010x
Qwen2.5-72B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	1240.01	1249.01	1.007x	1000.39	1012.30	1.012x
Qwen2.5-72B/TP8-GateUp	8192	7392	8192	torch.bfloat16	1294.76	1291.01	0.997x	1362.93	1368.38	1.004x
Qwen2.5-72B/TP8-Down	8192	8192	3696	torch.bfloat16	1373.41	1366.86	0.995x	1336.98	1333.71	0.998x

benchmark_gemm_fp8 (median 0.988x, min 0.401x, max 1.038x)

Case	M	N	K	dtype	FP8 Forward Base	FP8 Forward PR	FP8 Forward Speedup	FP8 Backward Base	FP8 Backward PR	FP8 Backward Speedup
Llama3-8B/TP1-QKV	1024	6144	4096	torch.bfloat16	416.89	407.31	0.977x	365.49	350.64	0.959x
Llama3-8B/TP1-AttnOut	1024	4096	4096	torch.bfloat16	286.54	273.39	0.954x	242.76	234.63	0.967x
Llama3-8B/TP1-GateUp	1024	28672	4096	torch.bfloat16	548.57	550.57	1.004x	1639.83	1702.61	1.038x
Llama3-8B/TP1-Down	1024	4096	14336	torch.bfloat16	446.49	445.96	0.999x	1540.76	1493.60	0.969x
Llama3-8B/TP1-QKV	2048	6144	4096	torch.bfloat16	846.58	809.82	0.957x	1432.76	680.41	0.475x
Llama3-8B/TP1-AttnOut	2048	4096	4096	torch.bfloat16	562.43	548.45	0.975x	1002.77	462.02	0.461x
Llama3-8B/TP1-GateUp	2048	28672	4096	torch.bfloat16	862.47	855.86	0.992x	1796.27	1792.72	0.998x
Llama3-8B/TP1-Down	2048	4096	14336	torch.bfloat16	713.95	715.87	1.003x	1802.56	1776.26	0.985x
Llama3-8B/TP1-QKV	4096	6144	4096	torch.bfloat16	1290.63	1271.35	0.985x	1840.90	1555.63	0.845x
Llama3-8B/TP1-AttnOut	4096	4096	4096	torch.bfloat16	1130.59	1089.03	0.963x	2051.26	905.90	0.442x
Llama3-8B/TP1-GateUp	4096	28672	4096	torch.bfloat16	1460.17	1467.62	1.005x	2154.53	2144.49	0.995x
Llama3-8B/TP1-Down	4096	4096	14336	torch.bfloat16	1014.90	1014.04	0.999x	2200.18	2194.11	0.997x
Llama3-8B/TP1-QKV	8192	6144	4096	torch.bfloat16	1771.48	1764.10	0.996x	1777.98	1780.02	1.001x
Llama3-8B/TP1-AttnOut	8192	4096	4096	torch.bfloat16	1516.59	1516.08	1.000x	1790.38	1780.72	0.995x
Llama3-8B/TP1-GateUp	8192	28672	4096	torch.bfloat16	1904.04	1901.36	0.999x	2185.83	2184.78	1.000x
Llama3-8B/TP1-Down	8192	4096	14336	torch.bfloat16	1287.22	1283.07	0.997x	2400.04	2391.69	0.997x
Llama3-8B/TP8-QKV	1024	768	4096	torch.bfloat16	51.30	49.23	0.960x	93.27	41.38	0.444x
Llama3-8B/TP8-AttnOut	1024	4096	512	torch.bfloat16	34.16	32.75	0.959x	61.68	27.38	0.444x
Llama3-8B/TP8-GateUp	1024	3584	4096	torch.bfloat16	236.53	227.33	0.961x	425.33	189.95	0.447x
Llama3-8B/TP8-Down	1024	4096	1792	torch.bfloat16	119.07	113.56	0.954x	213.74	94.65	0.443x
Llama3-8B/TP8-QKV	2048	768	4096	torch.bfloat16	102.02	97.55	0.956x	184.19	80.43	0.437x
Llama3-8B/TP8-AttnOut	2048	4096	512	torch.bfloat16	67.98	64.81	0.953x	109.52	53.81	0.491x
Llama3-8B/TP8-GateUp	2048	3584	4096	torch.bfloat16	472.76	454.97	0.962x	835.85	376.37	0.450x
Llama3-8B/TP8-Down	2048	4096	1792	torch.bfloat16	234.85	227.06	0.967x	421.16	186.50	0.443x
Llama3-8B/TP8-QKV	4096	768	4096	torch.bfloat16	201.66	194.86	0.966x	366.72	161.88	0.441x
Llama3-8B/TP8-AttnOut	4096	4096	512	torch.bfloat16	135.06	131.60	0.974x	241.29	107.82	0.447x
Llama3-8B/TP8-GateUp	4096	3584	4096	torch.bfloat16	943.94	900.94	0.954x	1673.36	746.75	0.446x
Llama3-8B/TP8-Down	4096	4096	1792	torch.bfloat16	475.53	456.38	0.960x	843.53	369.42	0.438x
Llama3-8B/TP8-QKV	8192	768	4096	torch.bfloat16	397.84	386.86	0.972x	717.45	318.65	0.444x
Llama3-8B/TP8-AttnOut	8192	4096	512	torch.bfloat16	266.86	259.11	0.971x	481.47	211.50	0.439x
Llama3-8B/TP8-GateUp	8192	3584	4096	torch.bfloat16	1287.71	1277.83	0.992x	1730.15	1763.56	1.019x
Llama3-8B/TP8-Down	8192	4096	1792	torch.bfloat16	934.26	899.90	0.963x	1184.55	735.82	0.621x
Llama3-70B/TP8-QKV	1024	1280	8192	torch.bfloat16	158.92	131.89	0.830x	282.81	141.65	0.501x
Llama3-70B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	133.15	129.72	0.974x	232.85	105.77	0.454x
Llama3-70B/TP8-GateUp	1024	7168	8192	torch.bfloat16	406.32	407.55	1.003x	1273.88	1300.77	1.021x
Llama3-70B/TP8-Down	1024	8192	3584	torch.bfloat16	463.08	449.43	0.971x	817.41	362.30	0.443x
Llama3-70B/TP8-QKV	2048	1280	8192	torch.bfloat16	332.69	324.45	0.975x	590.52	262.93	0.445x
Llama3-70B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	264.66	257.66	0.974x	470.04	206.70	0.440x
Llama3-70B/TP8-GateUp	2048	7168	8192	torch.bfloat16	766.83	777.71	1.014x	2069.73	2079.82	1.005x
Llama3-70B/TP8-Down	2048	8192	3584	torch.bfloat16	916.93	898.27	0.980x	1346.79	719.00	0.534x
Llama3-70B/TP8-QKV	4096	1280	8192	torch.bfloat16	626.26	626.45	1.000x	1185.91	518.39	0.437x
Llama3-70B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	520.77	508.45	0.976x	924.05	411.15	0.445x
Llama3-70B/TP8-GateUp	4096	7168	8192	torch.bfloat16	1040.12	1047.30	1.007x	2391.48	2396.51	1.002x
Llama3-70B/TP8-Down	4096	8192	3584	torch.bfloat16	1614.24	1604.80	0.994x	1550.40	1487.51	0.959x
Llama3-70B/TP8-QKV	8192	1280	8192	torch.bfloat16	571.96	565.82	0.989x	1728.83	1740.09	1.007x
Llama3-70B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	1044.13	1023.40	0.980x	690.52	690.23	1.000x
Llama3-70B/TP8-GateUp	8192	7168	8192	torch.bfloat16	1305.54	1309.23	1.003x	2321.50	2334.78	1.006x
Llama3-70B/TP8-Down	8192	8192	3584	torch.bfloat16	1995.64	2010.59	1.007x	1615.81	1619.20	1.002x
Llama3-405B/TP8-QKV	1024	2304	16384	torch.bfloat16	490.70	479.74	0.978x	891.25	470.88	0.528x
Llama3-405B/TP8-AttnOut	1024	16384	2048	torch.bfloat16	514.71	498.74	0.969x	848.12	391.64	0.462x
Llama3-405B/TP8-GateUp	1024	13312	16384	torch.bfloat16	579.98	577.80	0.996x	2366.75	2352.56	0.994x
Llama3-405B/TP8-Down	1024	16384	6656	torch.bfloat16	552.99	548.30	0.992x	1292.16	1300.64	1.007x
Llama3-405B/TP8-QKV	2048	2304	16384	torch.bfloat16	529.44	535.51	1.011x	1768.64	1536.04	0.868x
Llama3-405B/TP8-AttnOut	2048	16384	2048	torch.bfloat16	980.45	978.98	0.999x	965.67	743.78	0.770x
Llama3-405B/TP8-GateUp	2048	13312	16384	torch.bfloat16	901.21	899.02	0.998x	2624.82	2603.50	0.992x
Llama3-405B/TP8-Down	2048	16384	6656	torch.bfloat16	958.46	956.77	0.998x	1603.38	1592.63	0.993x
Llama3-405B/TP8-QKV	4096	2304	16384	torch.bfloat16	696.02	706.30	1.015x	2133.76	2121.95	0.994x
Llama3-405B/TP8-AttnOut	4096	16384	2048	torch.bfloat16	1396.06	1398.51	1.002x	1178.14	1162.43	0.987x
Llama3-405B/TP8-GateUp	4096	13312	16384	torch.bfloat16	1150.06	1151.27	1.001x	2754.81	2744.39	0.996x
Llama3-405B/TP8-Down	4096	16384	6656	torch.bfloat16	1445.70	1444.09	0.999x	1772.21	1764.32	0.996x
Llama3-405B/TP8-QKV	8192	2304	16384	torch.bfloat16	875.19	878.05	1.003x	2450.26	2451.96	1.001x
Llama3-405B/TP8-AttnOut	8192	16384	2048	torch.bfloat16	1895.28	1878.29	0.991x	1463.08	1464.39	1.001x
Llama3-405B/TP8-GateUp	8192	13312	16384	torch.bfloat16	1642.99	1643.30	1.000x	2818.38	2875.81	1.020x
Llama3-405B/TP8-Down	8192	16384	6656	torch.bfloat16	1781.32	1776.56	0.997x	1955.93	1950.84	0.997x
Qwen2.5-7B/TP1-QKV	1024	4608	3584	torch.bfloat16	245.20	239.18	0.975x	432.45	191.51	0.443x
Qwen2.5-7B/TP1-AttnOut	1024	3584	3584	torch.bfloat16	189.76	184.46	0.972x	334.31	148.58	0.444x
Qwen2.5-7B/TP1-GateUp	1024	37888	3584	torch.bfloat16	511.07	509.79	0.997x	1260.04	1254.61	0.996x
Qwen2.5-7B/TP1-Down	1024	3584	18944	torch.bfloat16	420.57	419.81	0.998x	1358.67	1353.00	0.996x
Qwen2.5-7B/TP1-QKV	2048	4608	3584	torch.bfloat16	486.59	472.44	0.971x	858.86	377.57	0.440x
Qwen2.5-7B/TP1-AttnOut	2048	3584	3584	torch.bfloat16	378.81	365.92	0.966x	668.03	291.22	0.436x
Qwen2.5-7B/TP1-GateUp	2048	37888	3584	torch.bfloat16	791.49	790.37	0.999x	1549.38	1545.83	0.998x
Qwen2.5-7B/TP1-Down	2048	3584	18944	torch.bfloat16	611.19	601.36	0.984x	1675.32	1698.05	1.014x
Qwen2.5-7B/TP1-QKV	4096	4608	3584	torch.bfloat16	974.46	935.89	0.960x	1692.07	728.21	0.430x
Qwen2.5-7B/TP1-AttnOut	4096	3584	3584	torch.bfloat16	713.30	727.26	1.020x	1313.12	563.62	0.429x
Qwen2.5-7B/TP1-GateUp	4096	37888	3584	torch.bfloat16	1159.85	1160.52	1.001x	1717.77	1715.36	0.999x
Qwen2.5-7B/TP1-Down	4096	3584	18944	torch.bfloat16	906.83	903.52	0.996x	1976.89	1979.59	1.001x
Qwen2.5-7B/TP1-QKV	8192	4608	3584	torch.bfloat16	1374.94	1364.16	0.992x	1517.27	1528.44	1.007x
Qwen2.5-7B/TP1-AttnOut	8192	3584	3584	torch.bfloat16	1264.20	1249.55	0.988x	1626.07	1194.42	0.735x
Qwen2.5-7B/TP1-GateUp	8192	37888	3584	torch.bfloat16	1896.52	1900.24	1.002x	1792.03	1776.86	0.992x
Qwen2.5-7B/TP1-Down	8192	3584	18944	torch.bfloat16	1112.18	1110.88	0.999x	2114.34	2109.42	0.998x
Qwen2.5-72B/TP8-QKV	1024	1280	8192	torch.bfloat16	146.27	142.46	0.974x	266.87	113.15	0.424x
Qwen2.5-72B/TP8-AttnOut	1024	8192	1024	torch.bfloat16	122.70	118.23	0.964x	214.73	91.38	0.426x
Qwen2.5-72B/TP8-GateUp	1024	7392	8192	torch.bfloat16	385.64	382.08	0.991x	1041.60	1047.27	1.005x
Qwen2.5-72B/TP8-Down	1024	8192	3696	torch.bfloat16	401.29	400.32	0.998x	840.16	336.64	0.401x
Qwen2.5-72B/TP8-QKV	2048	1280	8192	torch.bfloat16	307.29	293.19	0.954x	531.19	228.12	0.429x
Qwen2.5-72B/TP8-AttnOut	2048	8192	1024	torch.bfloat16	245.11	233.32	0.952x	421.54	181.18	0.430x
Qwen2.5-72B/TP8-GateUp	2048	7392	8192	torch.bfloat16	735.21	735.24	1.000x	1758.15	1741.20	0.990x
Qwen2.5-72B/TP8-Down	2048	8192	3696	torch.bfloat16	662.87	662.19	0.999x	1068.35	729.50	0.683x
Qwen2.5-72B/TP8-QKV	4096	1280	8192	torch.bfloat16	596.74	577.85	0.968x	1057.04	447.00	0.423x
Qwen2.5-72B/TP8-AttnOut	4096	8192	1024	torch.bfloat16	483.70	465.44	0.962x	847.84	358.43	0.423x
Qwen2.5-72B/TP8-GateUp	4096	7392	8192	torch.bfloat16	983.66	990.29	1.007x	1957.53	1957.75	1.000x
Qwen2.5-72B/TP8-Down	4096	8192	3696	torch.bfloat16	962.46	963.35	1.001x	1324.31	1329.60	1.004x
Qwen2.5-72B/TP8-QKV	8192	1280	8192	torch.bfloat16	560.82	554.91	0.989x	1485.29	1465.89	0.987x
Qwen2.5-72B/TP8-AttnOut	8192	8192	1024	torch.bfloat16	935.32	924.77	0.989x	698.43	699.77	1.002x
Qwen2.5-72B/TP8-GateUp	8192	7392	8192	torch.bfloat16	1215.17	1220.05	1.004x	2158.54	2146.23	0.994x
Qwen2.5-72B/TP8-Down	8192	8192	3696	torch.bfloat16	1246.18	1242.25	0.997x	1452.22	1446.39	0.996x

benchmark_grouped_gemm (median 0.999x, min 0.912x, max 1.134x)

Case	B	M	N	K	dtype	TE (CK_Tile) Forward Base	TE (CK_Tile) Forward PR	TE (CK_Tile) Forward Speedup	TE (CK_Tile) Backward Base	TE (CK_Tile) Backward PR	TE (CK_Tile) Backward Speedup
DSV2-Lite-GateUP	2	512	2816	2048	torch.bfloat16	241.31	241.50	1.001x	165.25	174.21	1.054x
DSV2-Lite-Down	2	512	2048	1408	torch.bfloat16	157.97	157.66	0.998x	164.25	163.75	0.997x
DSV2-Lite-GateUP	2	1024	2816	2048	torch.bfloat16	470.04	472.15	1.004x	317.46	300.52	0.947x
DSV2-Lite-Down	2	1024	2048	1408	torch.bfloat16	311.59	309.83	0.994x	293.90	294.03	1.000x
DSV2-Lite-GateUP	2	2048	2816	2048	torch.bfloat16	836.18	816.91	0.977x	542.45	535.42	0.987x
DSV2-Lite-Down	2	2048	2048	1408	torch.bfloat16	588.03	593.84	1.010x	487.69	488.22	1.001x
DSV2-Lite-GateUP	2	4096	2816	2048	torch.bfloat16	862.06	854.77	0.992x	821.46	829.75	1.010x
DSV2-Lite-Down	2	4096	2048	1408	torch.bfloat16	804.83	913.05	1.134x	535.44	535.77	1.001x
DSV2-Lite-GateUP	4	512	2816	2048	torch.bfloat16	464.77	466.73	1.004x	298.10	297.98	1.000x
DSV2-Lite-Down	4	512	2048	1408	torch.bfloat16	303.65	303.31	0.999x	271.82	247.78	0.912x
DSV2-Lite-GateUP	4	1024	2816	2048	torch.bfloat16	802.06	800.88	0.999x	509.17	503.26	0.988x
DSV2-Lite-Down	4	1024	2048	1408	torch.bfloat16	584.50	588.16	1.006x	457.76	459.09	1.003x
DSV2-Lite-GateUP	4	2048	2816	2048	torch.bfloat16	841.69	837.09	0.995x	758.58	760.67	1.003x
DSV2-Lite-Down	4	2048	2048	1408	torch.bfloat16	915.53	915.39	1.000x	506.16	503.30	0.994x
DSV2-Lite-GateUP	4	4096	2816	2048	torch.bfloat16	1035.61	1032.06	0.997x	814.15	814.28	1.000x
DSV2-Lite-Down	4	4096	2048	1408	torch.bfloat16	976.74	968.55	0.992x	617.04	611.33	0.991x
DSV2-Lite-GateUP	8	512	2816	2048	torch.bfloat16	811.88	803.25	0.989x	497.07	496.35	0.999x
DSV2-Lite-Down	8	512	2048	1408	torch.bfloat16	562.85	551.32	0.980x	437.47	435.13	0.995x
DSV2-Lite-GateUP	8	1024	2816	2048	torch.bfloat16	811.96	819.62	1.009x	765.54	779.72	1.019x
DSV2-Lite-Down	8	1024	2048	1408	torch.bfloat16	871.38	875.18	1.004x	487.58	512.78	1.052x
DSV2-Lite-GateUP	8	2048	2816	2048	torch.bfloat16	1009.45	1011.40	1.002x	840.10	847.07	1.008x
DSV2-Lite-Down	8	2048	2048	1408	torch.bfloat16	936.19	934.18	0.998x	623.49	625.78	1.004x
DSV2-Lite-GateUP	8	4096	2816	2048	torch.bfloat16	1018.90	1015.76	0.997x	874.56	873.36	0.999x
DSV2-Lite-Down	8	4096	2048	1408	torch.bfloat16	994.19	994.77	1.001x	687.93	692.68	1.007x
DSV2-GateUP	5	512	3072	5120	torch.bfloat16	758.10	748.07	0.987x	651.82	649.44	0.996x
DSV2-Down	5	512	5120	1536	torch.bfloat16	805.76	782.73	0.971x	309.70	309.18	0.998x
DSV2-GateUP	5	1024	3072	5120	torch.bfloat16	1055.65	1115.60	1.057x	726.08	738.29	1.017x
DSV2-Down	5	1024	5120	1536	torch.bfloat16	860.51	840.11	0.976x	530.47	526.90	0.993x
DSV2-GateUP	5	2048	3072	5120	torch.bfloat16	1117.45	1107.97	0.992x	791.38	788.22	0.996x
DSV2-Down	5	2048	5120	1536	torch.bfloat16	862.90	864.92	1.002x	801.96	794.04	0.990x
DSV2-GateUP	5	4096	3072	5120	torch.bfloat16	1150.31	1146.38	0.997x	893.54	895.39	1.002x
DSV2-Down	5	4096	5120	1536	torch.bfloat16	975.09	960.36	0.985x	833.25	830.31	0.996x
DSV2-GateUP	10	512	3072	5120	torch.bfloat16	1005.55	983.90	0.978x	643.24	639.69	0.994x
DSV2-Down	10	512	5120	1536	torch.bfloat16	833.59	807.19	0.968x	491.60	485.48	0.988x
DSV2-GateUP	10	1024	3072	5120	torch.bfloat16	1055.26	1053.50	0.998x	751.88	751.79	1.000x
DSV2-Down	10	1024	5120	1536	torch.bfloat16	831.97	826.24	0.993x	760.98	759.32	0.998x
DSV2-GateUP	10	2048	3072	5120	torch.bfloat16	1117.79	1101.13	0.985x	853.62	861.91	1.010x
DSV2-Down	10	2048	5120	1536	torch.bfloat16	923.67	929.06	1.006x	814.38	820.78	1.008x
DSV2-GateUP	10	4096	3072	5120	torch.bfloat16	1144.04	1135.59	0.993x	909.52	908.71	0.999x
DSV2-Down	10	4096	5120	1536	torch.bfloat16	987.12	984.72	0.998x	870.50	872.74	1.003x
DSV2-GateUP	20	512	3072	5120	torch.bfloat16	969.08	977.88	1.009x	632.85	635.43	1.004x
DSV2-Down	20	512	5120	1536	torch.bfloat16	770.09	762.78	0.991x	629.63	629.36	1.000x
DSV2-GateUP	20	1024	3072	5120	torch.bfloat16	1033.21	1038.29	1.005x	794.20	791.82	0.997x
DSV2-Down	20	1024	5120	1536	torch.bfloat16	876.91	872.56	0.995x	750.70	742.67	0.989x
DSV2-GateUP	20	2048	3072	5120	torch.bfloat16	1087.90	1080.25	0.993x	870.51	870.66	1.000x
DSV2-Down	20	2048	5120	1536	torch.bfloat16	953.93	952.25	0.998x	818.09	823.26	1.006x
DSV2-GateUP	20	4096	3072	5120	torch.bfloat16	1157.31	1155.49	0.998x	891.78	898.10	1.007x
DSV2-Down	20	4096	5120	1536	torch.bfloat16	989.75	991.38	1.002x	858.18	860.10	1.002x
DSV3-GateUP	8	512	4096	7168	torch.bfloat16	1050.53	1059.97	1.009x	687.06	685.03	0.997x
DSV3-Down	8	512	7168	2048	torch.bfloat16	909.32	901.57	0.991x	517.39	519.84	1.005x
DSV3-GateUP	8	1024	4096	7168	torch.bfloat16	1133.68	1134.79	1.001x	812.56	811.88	0.999x
DSV3-Down	8	1024	7168	2048	torch.bfloat16	963.17	971.34	1.008x	820.30	827.47	1.009x
DSV3-GateUP	8	2048	4096	7168	torch.bfloat16	1181.85	1179.63	0.998x	926.65	921.78	0.995x
DSV3-Down	8	2048	7168	2048	torch.bfloat16	1053.68	1049.93	0.996x	885.84	879.64	0.993x
DSV3-GateUP	8	4096	4096	7168	torch.bfloat16	1210.41	1201.53	0.993x	960.94	962.05	1.001x
DSV3-Down	8	4096	7168	2048	torch.bfloat16	1083.47	1085.41	1.002x	934.58	934.96	1.000x
DSV3-GateUP	16	512	4096	7168	torch.bfloat16	1040.65	1030.43	0.990x	668.79	669.72	1.001x
DSV3-Down	16	512	7168	2048	torch.bfloat16	901.13	900.01	0.999x	697.83	695.37	0.996x
DSV3-GateUP	16	1024	4096	7168	torch.bfloat16	1127.35	1119.62	0.993x	835.60	834.86	0.999x
DSV3-Down	16	1024	7168	2048	torch.bfloat16	1005.05	1003.60	0.999x	813.51	806.73	0.992x
DSV3-GateUP	16	2048	4096	7168	torch.bfloat16	1165.20	1166.80	1.001x	919.75	919.29	0.999x
DSV3-Down	16	2048	7168	2048	torch.bfloat16	1053.64	1051.81	0.998x	886.32	884.69	0.998x
DSV3-GateUP	16	4096	4096	7168	torch.bfloat16	1207.24	1209.12	1.002x	940.29	946.10	1.006x
DSV3-Down	16	4096	7168	2048	torch.bfloat16	1078.18	1077.07	0.999x	909.47	914.80	1.006x
DSV3-GateUP	32	512	4096	7168	torch.bfloat16	1016.13	1017.20	1.001x	664.62	661.71	0.996x
DSV3-Down	32	512	7168	2048	torch.bfloat16	927.19	906.34	0.978x	665.70	671.98	1.009x
DSV3-GateUP	32	1024	4096	7168	torch.bfloat16	1106.20	1102.44	0.997x	809.68	809.12	0.999x
DSV3-Down	32	1024	7168	2048	torch.bfloat16	976.05	981.99	1.006x	793.37	795.26	1.002x
DSV3-GateUP	32	2048	4096	7168	torch.bfloat16	1157.14	1155.39	0.998x	889.82	884.26	0.994x
DSV3-Down	32	2048	7168	2048	torch.bfloat16	1018.36	1024.81	1.006x	873.57	874.36	1.001x
DSV3-GateUP	32	4096	4096	7168	torch.bfloat16	1181.02	1182.44	1.001x	908.43	907.72	0.999x
DSV3-Down	32	4096	7168	2048	torch.bfloat16	1049.28	1047.24	0.998x	903.45	881.54	0.976x
Grok-V2-GateUP	1	512	32768	8192	torch.bfloat16	1016.40	1012.02	0.996x	885.37	898.65	1.015x
Grok-V2-Down	1	512	8192	16384	torch.bfloat16	770.72	764.61	0.992x	910.54	933.63	1.025x
Grok-V2-GateUP	1	1024	32768	8192	torch.bfloat16	1400.59	1427.92	1.020x	1220.00	1232.17	1.010x
Grok-V2-Down	1	1024	8192	16384	torch.bfloat16	1132.22	1171.32	1.035x	1207.33	1225.05	1.015x
Grok-V2-GateUP	1	2048	32768	8192	torch.bfloat16	1446.72	1448.37	1.001x	1374.03	1371.83	0.998x
Grok-V2-Down	1	2048	8192	16384	torch.bfloat16	1485.19	1483.65	0.999x	1338.63	1354.75	1.012x
Grok-V2-GateUP	1	4096	32768	8192	torch.bfloat16	1460.68	1475.09	1.010x	1415.70	1414.40	0.999x
Grok-V2-Down	1	4096	8192	16384	torch.bfloat16	1501.04	1499.14	0.999x	1401.87	1399.71	0.998x

benchmark_normalization (median 0.993x, min 0.399x, max 1.490x)

Case	M	hidden_size	dtype	TE Forward GB/s Base	TE Forward GB/s PR	TE Forward GB/s Speedup	TE Backward GB/s Base	TE Backward GB/s PR	TE Backward GB/s Speedup
Llama3-8B/RMSNorm	1024	4096	torch.bfloat16	638.60	624.90	0.979x	485.60	723.50	1.490x
Llama3-8B/RMSNorm	2048	4096	torch.bfloat16	1305.50	1272.80	0.975x	1425.70	1455.90	1.021x
Llama3-8B/RMSNorm	4096	4096	torch.bfloat16	2599.10	2535.90	0.976x	2952.10	2924.70	0.991x
Llama3-8B/RMSNorm	8192	4096	torch.bfloat16	5199.00	5077.80	0.977x	5496.30	5679.60	1.033x
Llama3-8B/LayerNorm	1024	4096	torch.bfloat16	552.20	555.60	1.006x	624.50	633.10	1.014x
Llama3-8B/LayerNorm	2048	4096	torch.bfloat16	1075.00	1110.80	1.033x	1271.70	1262.10	0.992x
Llama3-8B/LayerNorm	4096	4096	torch.bfloat16	2223.10	2195.00	0.987x	2508.00	2549.20	1.016x
Llama3-8B/LayerNorm	8192	4096	torch.bfloat16	4425.10	4423.10	1.000x	5060.20	5065.30	1.001x
Llama3-70B/RMSNorm	1024	8192	torch.bfloat16	1307.50	1294.10	0.990x	1448.60	1450.50	1.001x
Llama3-70B/RMSNorm	2048	8192	torch.bfloat16	2637.30	2578.10	0.978x	2786.50	2957.50	1.061x
Llama3-70B/RMSNorm	4096	8192	torch.bfloat16	4389.30	4381.20	0.998x	4916.80	4935.30	1.004x
Llama3-70B/RMSNorm	8192	8192	torch.bfloat16	4993.60	5036.40	1.009x	5392.30	5367.10	0.995x
Llama3-70B/LayerNorm	1024	8192	torch.bfloat16	1100.70	1124.10	1.021x	730.40	730.30	1.000x
Llama3-70B/LayerNorm	2048	8192	torch.bfloat16	2210.50	2195.90	0.993x	747.90	739.50	0.989x
Llama3-70B/LayerNorm	4096	8192	torch.bfloat16	4224.90	4258.50	1.008x	644.60	609.30	0.945x
Llama3-70B/LayerNorm	8192	8192	torch.bfloat16	4865.30	4836.00	0.994x	579.90	588.60	1.015x
Llama3-405B/RMSNorm	1024	16384	torch.bfloat16	660.50	651.70	0.987x	556.20	547.20	0.984x
Llama3-405B/RMSNorm	2048	16384	torch.bfloat16	705.90	707.90	1.003x	541.70	541.10	0.999x
Llama3-405B/RMSNorm	4096	16384	torch.bfloat16	725.70	736.10	1.014x	455.20	452.10	0.993x
Llama3-405B/RMSNorm	8192	16384	torch.bfloat16	581.40	581.80	1.001x	487.00	489.20	1.005x
Llama3-405B/LayerNorm	1024	16384	torch.bfloat16	675.60	691.60	1.024x	643.00	623.30	0.969x
Llama3-405B/LayerNorm	2048	16384	torch.bfloat16	690.30	690.80	1.001x	606.70	561.50	0.925x
Llama3-405B/LayerNorm	4096	16384	torch.bfloat16	702.70	696.50	0.991x	512.50	518.50	1.012x
Llama3-405B/LayerNorm	8192	16384	torch.bfloat16	562.60	563.80	1.002x	576.20	563.10	0.977x
Qwen2.5-7B/RMSNorm	1024	3584	torch.bfloat16	493.10	549.00	1.113x	398.80	253.20	0.635x
Qwen2.5-7B/RMSNorm	2048	3584	torch.bfloat16	1142.60	1104.60	0.967x	689.00	499.00	0.724x
Qwen2.5-7B/RMSNorm	4096	3584	torch.bfloat16	2252.90	2214.20	0.983x	1127.90	1013.50	0.899x
Qwen2.5-7B/RMSNorm	8192	3584	torch.bfloat16	3070.90	3066.90	0.999x	1819.80	1663.20	0.914x
Qwen2.5-7B/LayerNorm	1024	3584	torch.bfloat16	481.50	479.50	0.996x	348.70	218.50	0.627x
Qwen2.5-7B/LayerNorm	2048	3584	torch.bfloat16	960.40	943.90	0.983x	639.10	439.40	0.688x
Qwen2.5-7B/LayerNorm	4096	3584	torch.bfloat16	1900.00	1869.40	0.984x	1050.80	868.40	0.826x
Qwen2.5-7B/LayerNorm	8192	3584	torch.bfloat16	2688.60	2703.50	1.006x	1581.20	1549.50	0.980x
Qwen2.5-72B/RMSNorm	1024	8192	torch.bfloat16	1284.00	1248.70	0.973x	1442.30	576.10	0.399x
Qwen2.5-72B/RMSNorm	2048	8192	torch.bfloat16	2618.00	2506.80	0.958x	2917.20	1206.50	0.414x
Qwen2.5-72B/RMSNorm	4096	8192	torch.bfloat16	4393.80	4379.10	0.997x	4933.10	2493.20	0.505x
Qwen2.5-72B/RMSNorm	8192	8192	torch.bfloat16	5035.00	4997.90	0.993x	5387.60	5318.40	0.987x
Qwen2.5-72B/LayerNorm	1024	8192	torch.bfloat16	1108.30	1090.60	0.984x	745.10	486.50	0.653x
Qwen2.5-72B/LayerNorm	2048	8192	torch.bfloat16	2238.60	2231.80	0.997x	750.00	744.20	0.992x
Qwen2.5-72B/LayerNorm	4096	8192	torch.bfloat16	4254.00	4264.80	1.003x	647.00	612.20	0.946x
Qwen2.5-72B/LayerNorm	8192	8192	torch.bfloat16	4741.10	4861.60	1.025x	581.40	589.60	1.014x

Micky774

A few general comments in addition to the inline:

Regarding copyright, some spots are 2026 only while others are 2025-2026 -- is there a specific reason, or can we be specific and only set 2026?
It seems that dtype=torch.bfloat16 is hard-coded -- can we generalize to allow for e.g. fp16 benchmarks?
Can we document the bench_fn contract so that it's easier for new developers to contribute additional benchmarks?

Can we have a more general RECIPES dict similar to NV (

TransformerEngine/benchmarks/linear/benchmark_grouped_linear.py

Lines 60 to 65 in a0b88f4

    
           RECIPES = { 
        
               "bf16": None, 
        
               "fp8_sub_channel": Float8BlockScaling(), 
        
               "mxfp8": MXFP8BlockScaling(), 
        
               "nvfp4": NVFP4BlockScaling(), 
        
           }

)

Can we add a README.md to document?

Micky774 · 2026-05-11T15:49:02Z

+def _generate_gemm_test_cases():
+    test_cases = []
+    for M in M_SIZE_LIST:
+        for case_name, (N, K) in ACTIVE_SHAPES.items():
+            test_cases.append({
+                "Case": case_name,
+                "M": M,
+                "N": N,
+                "K": K,
+                "dtype": torch.bfloat16,
+            })
+    return test_cases


Can we abstract this to utils.py since it is shared with benchmark_gemm_fp8

Done in 284adda

Micky774 · 2026-05-11T15:50:18Z

+        group_lens[-1] += error
+        return group_lens
+
+M_SIZE_LIST = [512, 1024, 2048, 4096]


Can we either use the same M_SIZE_LIST in utils.py or rename it to GROUPED_GEMM_M_SIZE_LIST and document why it diverges?

Changed in 284adda

Micky774 · 2026-05-11T15:51:04Z

+            bv = merged.loc[idx, bc]
+            pv = merged.loc[idx, pc]
+
+            if pd.isna(bv) or pd.isna(pv) or bv < 0.5:


What is bv < 0.5 protecting?

That threshold is there to avoid reporting noisy speedups when the baseline metric is near zero. I made that more explicit and configurable in 284adda

Micky774 · 2026-05-11T15:52:30Z

Do we really need this to be a separate benchmark entirely, or can we combine with the bencmhark_gemm.py and include as e.g. a parameterization option?

I kept it separate for now because it has FP8-specific recipe/autocast setup and a separate output result.

Micky774 · 2026-05-11T15:58:28Z

+    if method == "blocked":
+        return timer.blocked_autorange().mean * 1e3
+    return timer.adaptive_autorange().mean * 1e3


Perhaps we should set min_run_time explicitly like NV does?

Added in 284adda

Micky774 · 2026-05-11T16:04:22Z

+        Column names to pull from each test case into the output row.
+    metric_columns : list[str]
+        Column names to pull from the bench_fn return value.
+    default_csv : str or None


Can we automatically derive this from the file name of the caller? That way we don't have to remember to manually set the default_csv to what is essentially just the file name.

Done in 284adda

Micky774 · 2026-05-11T16:24:04Z

+
+    x = torch.randn((B * M, K), dtype=dtype, device=device, requires_grad=True)
+    w = torch.randn((B, N, K), dtype=dtype, device=device, requires_grad=True)
+    group_lens = generate_grouped_gemm_group_lens(B, M, balance=True).to(device)


We currently hard-code balance=True but I think you've got infrastructure for general balance={True, False} so do we want to properly parameterize?

I think we should not enable the balance=False support in this PR yet. In the current form, it is not really viable; we would need more data how that imbalance should look like.

Micky774 · 2026-05-11T16:25:29Z

+        base_df[col] = pd.to_numeric(base_df[col], errors="coerce")
+        pr_df[col] = pd.to_numeric(pr_df[col], errors="coerce")
+
+    merged = base_df.merge(pr_df, on=key_cols, suffixes=("_base", "_pr"), how="inner")


how="inner" silently drops PR-only rows (new shapes added mid-run). Perhaps we should use how="outer" with explicit new/deleted-row handling?

Done in 284adda

matthiasdiener · 2026-05-11T20:50:09Z

Regarding copyright, some spots are 2026 only while others are 2025-2026 -- is there a specific reason, or can we be specific and only set 2026?

Changed to 2026 in 284adda.

It seems that dtype=torch.bfloat16 is hard-coded -- can we generalize to allow for e.g. fp16 benchmarks?

Done in 284adda.

Can we document the bench_fn contract so that it's easier for new developers to contribute additional benchmarks?

Added documentation in utils.py and README.

Can we have a more general RECIPES dict similar to NV (

TransformerEngine/benchmarks/linear/benchmark_grouped_linear.py

Lines 60 to 65 in a0b88f4

RECIPES = {

"bf16": None,

"fp8_sub_channel": Float8BlockScaling(),

"mxfp8": MXFP8BlockScaling(),

"nvfp4": NVFP4BlockScaling(),

}

)

Added in 284adda.

Can we add a README.md to document?

Added in ca1f442.

matthiasdiener self-assigned this Mar 10, 2026

matthiasdiener force-pushed the mdiener/ci-microbench branch 2 times, most recently from d9f25f2 to ce0775a Compare March 10, 2026 20:28

initial impl

8a0ea47

matthiasdiener force-pushed the mdiener/ci-microbench branch from ce0775a to 8a0ea47 Compare March 10, 2026 22:12

This comment was marked as outdated.

Sign in to view

matthiasdiener added 3 commits March 11, 2026 11:34

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

4270296

put into benchmarks subfolder

6ddb77d

restructure comment

fb2b3f3

matthiasdiener added 5 commits March 11, 2026 16:55

misc updates

d4e9b1e

python fix

95358f4

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

a675d17

another embedded python fix

d0a320d

replace py code

6f45853

wenchenvincent requested a review from Copilot March 12, 2026 16:37

Copilot started reviewing on behalf of wenchenvincent March 12, 2026 16:38 View session

This comment was marked as outdated.

Sign in to view

ROCm deleted a comment from Copilot AI Mar 12, 2026

matthiasdiener and others added 8 commits March 13, 2026 10:13

Merge branch 'dev' into mdiener/ci-microbench

e5eaf10

restore disabled parts of CI

55e7eb5

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

7072e82

add attention, casting, normalization

9c771b4

add timestamp and commit ID

64e8da8

add FP8 GEMM

c986c97

fix name

4f6dc86

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

811e329

updates casting

c9d6d4d

matthiasdiener marked this pull request as draft March 26, 2026 02:35

matthiasdiener commented Apr 15, 2026

View reviewed changes

Comment thread benchmarks/microbenchmarks/benchmark_grouped_gemm.py

matthiasdiener added 4 commits April 20, 2026 09:13

Merge branch 'dev' into mdiener/ci-microbench

4bc11df

remove attention

de21a77

fix grouped gemm

1d6f869

remove CI part

12b4218

matthiasdiener changed the title ~~Microbenchmarking and CI performance regression test~~ Microbenchmarking in pytorch Apr 20, 2026

matthiasdiener changed the title ~~Microbenchmarking in pytorch~~ Microbenchmarking, CSV-based Apr 20, 2026

matthiasdiener added 10 commits April 21, 2026 16:34

use adaptive_autorange, cleanups

75c8291

add csv to asv converter

2e6da68

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

6353411

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

a1c6453

Merge remote-tracking branch 'upstream/dev' into mdiener/ci-microbench

33a3137

refactor

aa8997c

remove asv converter

fefaf13

cleanups, misc fixes

7f2669d

Merge remote-tracking branch 'origin/dev' into mdiener/ci-microbench

117c2d7

Merge remote-tracking branch 'upstream/dev' into mdiener/ci-microbench

bc824d7

matthiasdiener changed the title ~~Microbenchmarking, CSV-based~~ Microbenchmarking, Torch+CSV-based May 11, 2026

Micky774 requested changes May 11, 2026

View reviewed changes

matthiasdiener added 4 commits May 11, 2026 14:12

simplifications, address review comments

d4e116a

Llama 3.1

372e6df

address reviewer comments

284adda

add readme

ca1f442

matthiasdiener marked this pull request as ready for review May 11, 2026 20:50

matthiasdiener requested review from Micky774, alextmagro and ipanfilo May 11, 2026 20:50

	RECIPES = {
	"bf16": None,
	"fp8_sub_channel": Float8BlockScaling(),
	"mxfp8": MXFP8BlockScaling(),
	"nvfp4": NVFP4BlockScaling(),
	}

Conversation

matthiasdiener commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

github-actions Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Regression Report

MI325

MI355

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Micky774 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

matthiasdiener commented Mar 10, 2026 •

edited

Loading

github-actions Bot commented Mar 11, 2026 •

edited

Loading

Micky774 May 11, 2026 •

edited

Loading