perf: fuse adjacent OffloadedStmt range_for tasks at JIT time#37
Open
npoulad1 wants to merge 8 commits into
Open
perf: fuse adjacent OffloadedStmt range_for tasks at JIT time#37npoulad1 wants to merge 8 commits into
npoulad1 wants to merge 8 commits into
Conversation
The Genesis hot path emits many small back-to-back range_for offloads (zero-init, integration, contact, ...). Each is its own GPU dispatch and saturates the AMDGPU command queue even when individual kernels are cheap. Add an opt-in JIT pass that merges adjacent range_for OffloadedStmts with identical launch bounds and no inter-task races. Safety: * Disjoint resources: A and B touch unrelated ndarrays / snodes / gtmp slots. * Same-thread RAW: every shared-resource access in either body has the same per-thread, non-indirect address fingerprint, so thread T touches the same byte in both bodies and no other thread races against it. Runs after FixCrossOffloadReferences so OffloadedStmt's gtmp offsets are final when the dynamic-bound matcher runs. Gated by QD_FUSE_TASKS (off by default); QD_FUSE_TASKS_RAW=0 is a kill switch for the RAW relaxation. QD_FUSE_TASKS_DIAG=1 emits per-kernel pass diagnostics. Co-authored-by: Cursor <cursoragent@cursor.com>
The cuda_graph dispatch path relies on each top-level for-loop producing its own OffloadedStmt so the graph manager can capture them as graph nodes. The fusion pass added in 48b4eed was collapsing those tasks and breaking test_cuda_graph_* expectations. Plumb the @qd.kernel(cuda_graph=True) annotation onto the C++ Kernel during materialize() so compile-time passes can see it, then have fuse_offloaded_tasks return early when the owning kernel opted into cuda_graph. Also relax an incidental _num_offloaded_tasks() >= 2 assertion in test_no_cuda_graph_annotation: the test's docstring is about the graph dispatch path; the post-JIT task count for two disjoint loops is no longer part of its contract. Co-authored-by: Cursor <cursoragent@cursor.com>
Flip the QD_FUSE_TASKS default from off to on so users get the fusion + same-thread RAW lift out of the box. The env var becomes a kill switch instead of an opt-in: set QD_FUSE_TASKS=0 (or off/false/no) to bypass the pass entirely. QD_FUSE_TASKS_RAW=0 remains available to keep fusion on while disabling the same-thread RAW relaxation only. Comment updates only on the gating; no behavior change for callers that were already setting QD_FUSE_TASKS=1. Co-authored-by: Cursor <cursoragent@cursor.com>
Apply pre-commit clang-format (mirrors-clang-format v19.1.7) to fix CI lint failure. Whitespace-only. Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
/run-ci |
The same-thread RAW relaxation only admitted fusion when every access to a shared resource "depended on the loop index" (AccessFp::has_loop_idx). That check was too weak: non-injective expressions like arr[i // 2], arr[i % 2], arr[min(i, K)], or arr[i & 1] syntactically contain a LoopIndexStmt but collapse multiple iterations onto the same byte. Pre-fusion the kernel boundary serialised the resulting cross-thread race; post-fusion same-thread RAW substituted each thread's local write for the cross-thread shared read, changing observable semantics. fingerprint_value now tracks injectivity per-node using local bools rather than OR-cumulating across the whole subtree, so (i + 1) / 2 is correctly classified as non-injective even though (i + 1) is. The per-operator whitelist admits LoopIndexStmt, add/sub/mul/bit_shl/bit_xor with compile-time constants (and non-zero for mul), neg, bit_not, and cast_bits; everything else (div, mod, max, min, bit_and, bit_or, bit_shr/sar, cast_value, cmp_*, math ops) is rejected. AccessFp::has_loop_idx is renamed to is_injective_in_loop_idx to reflect the stronger guarantee. Benchmarks at n_envs=8192 / 500 steps on Genesis show no regression (1.90M vs 1.79M baseline env-steps/s). Co-authored-by: Cursor <cursoragent@cursor.com>
Match the pre-commit hook output: collapse the parenthesised assert message strings onto a single line. Co-authored-by: Cursor <cursoragent@cursor.com>
…f by default)
The pass is now opt-in: QD_FUSE_TASKS / QD_FUSE_TASKS_RAW /
QD_FUSE_TASKS_DIAG are renamed to QD_AGGRESSIVE_KERNEL_FUSION{,_RAW,
_DIAG}, and the master knob's default flips from on to off. With the
master knob unset the offload pass returns immediately, restoring the
historical no-fusion pipeline; workloads that have validated fusion
under their own correctness and performance criteria can opt in by
exporting QD_AGGRESSIVE_KERNEL_FUSION=1 (truthy values: 1, on, true,
yes; case-insensitive).
The pass currently has no register-pressure / LDS / I-cache cost model
and the same-thread RAW relaxation is a syntactic safety check rather
than a real loop-dependence analysis, so making it opt-in keeps the
default code path unchanged for general workloads while still letting
Genesis (currently the only validated user) benefit from it. A real
cost heuristic and an opt-out kernel attribute can be added on top of
this gate without further surgery to the default pipeline.
The new test file enables the env var at module load (before any
kernel compiles in the same pytest process) so the regression coverage
for the injectivity check still exercises the fusion code path.
Co-authored-by: Cursor <cursoragent@cursor.com>
The QD_AGGRESSIVE_KERNEL_FUSION* env-var names pushed three comment blocks over 80 columns; clang-format rewraps them. No code changes. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a Quadrants kernel has multiple top-level
range_forloops, each becomes its own GPU dispatch. On AMDGPU that command queue is the throughput bottleneck. This PR adds a JIT pass that fuses adjacentrange_fortasks with matching launch bounds into a single dispatch.Safety. A pair fuses iff one of:
LoopIndexStmt, add/sub-by-const, mul-by-non-zero-const, shl/xor-by-const, neg, bit_not, bitcast). Non-injective patterns likearr[i // 2],arr[i % 2],arr[min(i, K)],arr[0]are explicitly rejected; cross-iteration patterns likearr[i] = …; … = arr[i-1]are rejected by the single-fingerprint gate.Both gates conservatively reject anything they don't recognise. The pass runs after
FixCrossOffloadReferencesso dynamic gtmp-resolved bounds are populated.cuda_graph carve-out.
@qd.kernel(cuda_graph=True)requires each top-level for-loop to keep its ownOffloadedStmtso the graph manager can capture it as a node. The annotation is now mirrored ontoKernelduringmaterialize()and the fusion pass bails out when it's set.Knobs.
QD_AGGRESSIVE_KERNEL_FUSION=1— enable the pass.QD_AGGRESSIVE_KERNEL_FUSION_RAW=0— keep fusion on, disable rule 2 only.QD_AGGRESSIVE_KERNEL_FUSION_DIAG=1— print per-kernel fusion decisions and reject reasons to stderr.