Microbenchmarking, Torch+CSV-based#478
Conversation
d9f25f2 to
ce0775a
Compare
ce0775a to
8a0ea47
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Performance Regression ReportMI325PR commit: ddd17d4 | Base:
benchmark_attention (median 1.000x, min 0.650x, max 1.390x)
benchmark_casting (median 0.998x, min 0.920x, max 1.034x)
benchmark_gemm (median 1.001x, min 0.322x, max 2.361x)
benchmark_gemm_fp8 (median 0.986x, min 0.285x, max 1.610x)
benchmark_grouped_gemm (median 1.000x, min 0.439x, max 2.438x)
benchmark_normalization (median 1.006x, min 0.633x, max 2.066x)
MI355PR commit: ddd17d4 | Base:
benchmark_attention (median 1.000x, min 0.969x, max 1.014x)
benchmark_casting (median 0.998x, min 0.898x, max 1.138x)
benchmark_gemm (median 1.001x, min 0.935x, max 1.094x)
benchmark_gemm_fp8 (median 0.988x, min 0.401x, max 1.038x)
benchmark_grouped_gemm (median 0.999x, min 0.912x, max 1.134x)
benchmark_normalization (median 0.993x, min 0.399x, max 1.490x)
|
Micky774
left a comment
There was a problem hiding this comment.
A few general comments in addition to the inline:
- Regarding copyright, some spots are 2026 only while others are 2025-2026 -- is there a specific reason, or can we be specific and only set 2026?
- It seems that
dtype=torch.bfloat16is hard-coded -- can we generalize to allow for e.g. fp16 benchmarks? - Can we document the
bench_fncontract so that it's easier for new developers to contribute additional benchmarks? - Can we have a more general
RECIPESdict similar to NV ()TransformerEngine/benchmarks/linear/benchmark_grouped_linear.py
Lines 60 to 65 in a0b88f4
- Can we add a
README.mdto document?
| def _generate_gemm_test_cases(): | ||
| test_cases = [] | ||
| for M in M_SIZE_LIST: | ||
| for case_name, (N, K) in ACTIVE_SHAPES.items(): | ||
| test_cases.append({ | ||
| "Case": case_name, | ||
| "M": M, | ||
| "N": N, | ||
| "K": K, | ||
| "dtype": torch.bfloat16, | ||
| }) | ||
| return test_cases |
There was a problem hiding this comment.
Can we abstract this to utils.py since it is shared with benchmark_gemm_fp8
| group_lens[-1] += error | ||
| return group_lens | ||
|
|
||
| M_SIZE_LIST = [512, 1024, 2048, 4096] |
There was a problem hiding this comment.
Can we either use the same M_SIZE_LIST in utils.py or rename it to GROUPED_GEMM_M_SIZE_LIST and document why it diverges?
| bv = merged.loc[idx, bc] | ||
| pv = merged.loc[idx, pc] | ||
|
|
||
| if pd.isna(bv) or pd.isna(pv) or bv < 0.5: |
There was a problem hiding this comment.
What is bv < 0.5 protecting?
There was a problem hiding this comment.
That threshold is there to avoid reporting noisy speedups when the baseline metric is near zero. I made that more explicit and configurable in 284adda
There was a problem hiding this comment.
Do we really need this to be a separate benchmark entirely, or can we combine with the bencmhark_gemm.py and include as e.g. a parameterization option?
There was a problem hiding this comment.
I kept it separate for now because it has FP8-specific recipe/autocast setup and a separate output result.
| if method == "blocked": | ||
| return timer.blocked_autorange().mean * 1e3 | ||
| return timer.adaptive_autorange().mean * 1e3 |
There was a problem hiding this comment.
Perhaps we should set min_run_time explicitly like NV does?
| Column names to pull from each test case into the output row. | ||
| metric_columns : list[str] | ||
| Column names to pull from the bench_fn return value. | ||
| default_csv : str or None |
There was a problem hiding this comment.
Can we automatically derive this from the file name of the caller? That way we don't have to remember to manually set the default_csv to what is essentially just the file name.
|
|
||
| x = torch.randn((B * M, K), dtype=dtype, device=device, requires_grad=True) | ||
| w = torch.randn((B, N, K), dtype=dtype, device=device, requires_grad=True) | ||
| group_lens = generate_grouped_gemm_group_lens(B, M, balance=True).to(device) |
There was a problem hiding this comment.
We currently hard-code balance=True but I think you've got infrastructure for general balance={True, False} so do we want to properly parameterize?
There was a problem hiding this comment.
I think we should not enable the balance=False support in this PR yet. In the current form, it is not really viable; we would need more data how that imbalance should look like.
| base_df[col] = pd.to_numeric(base_df[col], errors="coerce") | ||
| pr_df[col] = pd.to_numeric(pr_df[col], errors="coerce") | ||
|
|
||
| merged = base_df.merge(pr_df, on=key_cols, suffixes=("_base", "_pr"), how="inner") |
There was a problem hiding this comment.
how="inner" silently drops PR-only rows (new shapes added mid-run). Perhaps we should use how="outer" with explicit new/deleted-row handling?
Changed to 2026 in 284adda.
Done in 284adda.
Added documentation in utils.py and README.
Added in 284adda.
Added in ca1f442. |
Description
See also #487.
Pytorch benchmark timing: https://docs.pytorch.org/tutorials/recipes/recipes/benchmark.html
Open questions:
Do we need to rebuild the PR branch after perf testing is done?Partly addresses https://github.com/ROCm/frameworks-internal/issues/15863
Microbenchmarking (not just) for CI.
TODOs:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: