perf: fuse adjacent OffloadedStmt range_for tasks at JIT time by npoulad1 · Pull Request #37 · ROCm/quadrants

npoulad1 · 2026-05-12T03:09:14Z

When a Quadrants kernel has multiple top-level range_for loops, each becomes its own GPU dispatch. On AMDGPU that command queue is the throughput bottleneck. This PR adds a JIT pass that fuses adjacent range_for tasks with matching launch bounds into a single dispatch.

Safety. A pair fuses iff one of:

Disjoint resources — A and B touch entirely different ndarrays / SNodes / gtmp slots.
Same-thread RAW — every access to a shared resource fingerprints to a single per-thread address that is a provably injective function of the loop index (only LoopIndexStmt, add/sub-by-const, mul-by-non-zero-const, shl/xor-by-const, neg, bit_not, bitcast). Non-injective patterns like arr[i // 2], arr[i % 2], arr[min(i, K)], arr[0] are explicitly rejected; cross-iteration patterns like arr[i] = …; … = arr[i-1] are rejected by the single-fingerprint gate.

Both gates conservatively reject anything they don't recognise. The pass runs after FixCrossOffloadReferences so dynamic gtmp-resolved bounds are populated.

cuda_graph carve-out. @qd.kernel(cuda_graph=True) requires each top-level for-loop to keep its own OffloadedStmt so the graph manager can capture it as a node. The annotation is now mirrored onto Kernel during materialize() and the fusion pass bails out when it's set.

Knobs.

QD_AGGRESSIVE_KERNEL_FUSION=1 — enable the pass.
QD_AGGRESSIVE_KERNEL_FUSION_RAW=0 — keep fusion on, disable rule 2 only.
QD_AGGRESSIVE_KERNEL_FUSION_DIAG=1 — print per-kernel fusion decisions and reject reasons to stderr.

The Genesis hot path emits many small back-to-back range_for offloads (zero-init, integration, contact, ...). Each is its own GPU dispatch and saturates the AMDGPU command queue even when individual kernels are cheap. Add an opt-in JIT pass that merges adjacent range_for OffloadedStmts with identical launch bounds and no inter-task races. Safety: * Disjoint resources: A and B touch unrelated ndarrays / snodes / gtmp slots. * Same-thread RAW: every shared-resource access in either body has the same per-thread, non-indirect address fingerprint, so thread T touches the same byte in both bodies and no other thread races against it. Runs after FixCrossOffloadReferences so OffloadedStmt's gtmp offsets are final when the dynamic-bound matcher runs. Gated by QD_FUSE_TASKS (off by default); QD_FUSE_TASKS_RAW=0 is a kill switch for the RAW relaxation. QD_FUSE_TASKS_DIAG=1 emits per-kernel pass diagnostics. Co-authored-by: Cursor <cursoragent@cursor.com>

The cuda_graph dispatch path relies on each top-level for-loop producing its own OffloadedStmt so the graph manager can capture them as graph nodes. The fusion pass added in 48b4eed was collapsing those tasks and breaking test_cuda_graph_* expectations. Plumb the @qd.kernel(cuda_graph=True) annotation onto the C++ Kernel during materialize() so compile-time passes can see it, then have fuse_offloaded_tasks return early when the owning kernel opted into cuda_graph. Also relax an incidental _num_offloaded_tasks() >= 2 assertion in test_no_cuda_graph_annotation: the test's docstring is about the graph dispatch path; the post-JIT task count for two disjoint loops is no longer part of its contract. Co-authored-by: Cursor <cursoragent@cursor.com>

Flip the QD_FUSE_TASKS default from off to on so users get the fusion + same-thread RAW lift out of the box. The env var becomes a kill switch instead of an opt-in: set QD_FUSE_TASKS=0 (or off/false/no) to bypass the pass entirely. QD_FUSE_TASKS_RAW=0 remains available to keep fusion on while disabling the same-thread RAW relaxation only. Comment updates only on the gating; no behavior change for callers that were already setting QD_FUSE_TASKS=1. Co-authored-by: Cursor <cursoragent@cursor.com>

Apply pre-commit clang-format (mirrors-clang-format v19.1.7) to fix CI lint failure. Whitespace-only. Co-authored-by: Cursor <cursoragent@cursor.com>

npoulad1 · 2026-05-12T17:38:42Z

/run-ci

The same-thread RAW relaxation only admitted fusion when every access to a shared resource "depended on the loop index" (AccessFp::has_loop_idx). That check was too weak: non-injective expressions like arr[i // 2], arr[i % 2], arr[min(i, K)], or arr[i & 1] syntactically contain a LoopIndexStmt but collapse multiple iterations onto the same byte. Pre-fusion the kernel boundary serialised the resulting cross-thread race; post-fusion same-thread RAW substituted each thread's local write for the cross-thread shared read, changing observable semantics. fingerprint_value now tracks injectivity per-node using local bools rather than OR-cumulating across the whole subtree, so (i + 1) / 2 is correctly classified as non-injective even though (i + 1) is. The per-operator whitelist admits LoopIndexStmt, add/sub/mul/bit_shl/bit_xor with compile-time constants (and non-zero for mul), neg, bit_not, and cast_bits; everything else (div, mod, max, min, bit_and, bit_or, bit_shr/sar, cast_value, cmp_*, math ops) is rejected. AccessFp::has_loop_idx is renamed to is_injective_in_loop_idx to reflect the stronger guarantee. Benchmarks at n_envs=8192 / 500 steps on Genesis show no regression (1.90M vs 1.79M baseline env-steps/s). Co-authored-by: Cursor <cursoragent@cursor.com>

Match the pre-commit hook output: collapse the parenthesised assert message strings onto a single line. Co-authored-by: Cursor <cursoragent@cursor.com>

…f by default) The pass is now opt-in: QD_FUSE_TASKS / QD_FUSE_TASKS_RAW / QD_FUSE_TASKS_DIAG are renamed to QD_AGGRESSIVE_KERNEL_FUSION{,_RAW, _DIAG}, and the master knob's default flips from on to off. With the master knob unset the offload pass returns immediately, restoring the historical no-fusion pipeline; workloads that have validated fusion under their own correctness and performance criteria can opt in by exporting QD_AGGRESSIVE_KERNEL_FUSION=1 (truthy values: 1, on, true, yes; case-insensitive). The pass currently has no register-pressure / LDS / I-cache cost model and the same-thread RAW relaxation is a syntactic safety check rather than a real loop-dependence analysis, so making it opt-in keeps the default code path unchanged for general workloads while still letting Genesis (currently the only validated user) benefit from it. A real cost heuristic and an opt-out kernel attribute can be added on top of this gate without further surgery to the default pipeline. The new test file enables the env var at module load (before any kernel compiles in the same pytest process) so the regression coverage for the injectivity check still exercises the fusion code path. Co-authored-by: Cursor <cursoragent@cursor.com>

The QD_AGGRESSIVE_KERNEL_FUSION* env-var names pushed three comment blocks over 80 columns; clang-format rewraps them. No code changes. Co-authored-by: Cursor <cursoragent@cursor.com>

npoulad1 and others added 4 commits May 12, 2026 01:04

style: clang-format fuse_offloaded_tasks.cpp

76343d1

Apply pre-commit clang-format (mirrors-clang-format v19.1.7) to fix CI lint failure. Whitespace-only. Co-authored-by: Cursor <cursoragent@cursor.com>

npoulad1 and others added 4 commits May 13, 2026 19:50

style: black-format test_fuse_offloaded_tasks.py

da36796

Match the pre-commit hook output: collapse the parenthesised assert message strings onto a single line. Co-authored-by: Cursor <cursoragent@cursor.com>

style: clang-format reflow comments around renamed env var

87efa83

The QD_AGGRESSIVE_KERNEL_FUSION* env-var names pushed three comment blocks over 80 columns; clang-format rewraps them. No code changes. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fuse adjacent OffloadedStmt range_for tasks at JIT time#37

perf: fuse adjacent OffloadedStmt range_for tasks at JIT time#37
npoulad1 wants to merge 8 commits into
amd-integrationfrom
perf/npoulad/fuse-offloaded-tasks

npoulad1 commented May 12, 2026 •

edited

Loading

Uh oh!

npoulad1 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

npoulad1 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npoulad1 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

npoulad1 commented May 12, 2026 •

edited

Loading