[BREAKING][GPU] New QIPC ops for subgroups#676
Open
hughperkins wants to merge 112 commits into
Open
Conversation
Aligns the subgroup scope with `block.sync()` and the planned
`block.mem_fence()` / `grid.mem_fence()` naming. The old names remain
as thin aliases that forward to the new ones and emit a
DeprecationWarning on first use (per-alias one-shot guard, plus the
existing `warnings.filterwarnings("once", DeprecationWarning, ...)`
in `quadrants.lang.misc`).
Updates `docs/source/user_guide/subgroup.md` to describe the renames
as done (with deprecation aliases) rather than planned.
Brings the four previously partial / TODO data-movement ops up to full
CUDA + AMDGPU + SPIR-V coverage:
* shuffle_up: add CUDA + AMDGPU lowerings.
- CUDA: new `cuda_shuffle_up_{i32,f32,i64,f64}` runtime helpers in
runtime_module/runtime.cpp (mirroring `cuda_shuffle_down_*`), built
on the already-patched `cuda_shfl_up_sync_{i32,f32}` NVVM intrinsics.
Codegen branch + `emit_cuda_shuffle_up` in codegen/cuda/codegen_cuda.cpp.
- AMDGPU: new `amdgpu_shuffle_up_{i32,f32,i64,f64}` runtime helpers
using the existing `ds_bpermute` path (same FIXME re: DPP fast-path
as `shuffle_down`). Codegen branch + `emit_amdgpu_shuffle_up`.
* shuffle_xor and broadcast_first: replace TODO `pass` stubs with
portable `@qd.func` wrappers that inline into the calling kernel:
- `shuffle_xor(value, mask)` ≡ `shuffle(value, u32(lane) ^ mask)`
- `broadcast_first(value)` ≡ `broadcast(value, u32(0))`
No backend codegen / runtime changes required: every backend that
lowers `shuffle` / `broadcast` now lowers these too.
Tests:
* test_subgroup_shuffle_up (mirror of test_subgroup_shuffle_down)
* test_subgroup_shuffle_xor (uses the new wrapper directly; the
existing `_pattern` test continues to verify the manual emulation)
* test_subgroup_broadcast_first
Doc: refresh `docs/source/user_guide/subgroup.md` data-movement
support matrix + per-op semantics + performance notes to reflect
universal coverage. Drop the now-stale "fail to link on CUDA / AMDGPU"
paragraph from the `shuffle_up` section.
Adds the missing test coverage for the rename half of this PR: * test_subgroup_sync (vulkan): smoke that subgroup.sync() — the renamed subgroup.barrier() — traces and runs. * test_subgroup_mem_fence (vulkan): same for subgroup.mem_fence(). * test_subgroup_barrier_deprecation_warn_once: pure-Python unit test asserting subgroup.barrier() emits exactly one DeprecationWarning across multiple calls and forwards to sync(); monkeypatches sync to a no-op so no kernel context is required and the test runs on every arch. * test_subgroup_memory_barrier_deprecation_warn_once: mirror for subgroup.memory_barrier() / subgroup.mem_fence().
… + SPIR-V The data-movement ops in qd.simt.subgroup require uniform control flow with all lanes active (already documented in subgroup.md). Under that contract subgroups (warps / waves) execute in lockstep on CUDA and AMDGPU, so an intra-subgroup control barrier or memory fence is a no-op on those backends. The SPIR-V backend keeps the real OpControlBarrier / OpMemoryBarrier emission because Vulkan / Metal subgroups can diverge. Lower subgroupBarrier / subgroupMemoryBarrier to a placeholder i32 0 (matching the SPIR-V codegen's return convention) on the CUDA and AMDGPU codegen, so calling subgroup.sync() / subgroup.mem_fence() from a kernel succeeds on every GPU backend. The smoke tests for sync()/mem_fence() are now arch=qd.gpu rather than arch=qd.vulkan and confirm tracing + running on each backend. Doc: matrix updated to yes/yes/yes (with a footnote explaining the no-op-on-CUDA/AMDGPU semantics) and the per-op section rewritten to describe the universal lowering.
…+ AMDGPU + SPIR-V" This reverts commit 233b08c. The "no-op on CUDA / AMDGPU" lowering conflated control-flow lockstep with memory ordering. The two are not equivalent: * `sync()` (control barrier) under our uniform-CF + all-lanes-active contract really is a no-op on CUDA / AMDGPU, because warps / waves are already at the same program point. That part was defensible. * `mem_fence()` (memory fence) is NOT a no-op. Lockstep execution does not order memory operations: the compiler may reorder loads / stores across the call, and the SM may buffer writes. A correct CUDA lowering would need at minimum an LLVM `fence` intrinsic with the appropriate scope (or `__threadfence_block()` as an over-strict fallback). That was not done. Rather than ship a half-correct lowering, restore the previous status: both ops remain SPIR-V only, the doc keeps its original "warps are lockstep, these are typically unnecessary; use __syncwarp under divergent control flow" guidance, and the smoke tests stay on arch=qd.vulkan. Implementing real CUDA / AMDGPU lowerings can be a separate, properly-thought-through change.
…GPU + SPIR-V
Replaces the earlier (reverted) attempt that lowered these to no-ops on CUDA / AMDGPU
"because warps are lockstep", which was wrong about what the user contract guarantees:
sync() must reconverge lanes that have been split by independent thread scheduling
(Volta+) and mem_fence() must actually order memory. This change wires real backend
primitives into the lowering and fixes a long-standing SPIR-V mem_fence() bug.
Per-backend lowerings
---------------------
sync() (subgroupBarrier):
* SPIR-V : already correct - OpControlBarrier(Subgroup, Subgroup, 0).
* CUDA : warp_barrier(0xFFFFFFFF), reusing the existing runtime helper that is
patched to llvm.nvvm.bar.warp.sync (i.e. __syncwarp). This is the
precise warp-scope reconvergence primitive Volta+ needs and is a no-op
under uniform CF on Pascal.
* AMDGPU : llvm.amdgcn.wave.barrier - LLVM's wave-scope sync primitive. Acts as a
compiler reordering barrier on GCN (lockstep) and emits a real wave
barrier on RDNA where waves can span multiple SIMDs.
mem_fence() (subgroupMemoryBarrier):
* SPIR-V : was emitting OpMemoryBarrier(Subgroup, 0). The Memory Semantics operand
must have an ordering bit AND at least one storage class, so 0 is
invalid; drivers that accept it treat the instruction as a no-op. Now
emits AcquireRelease | UniformMemory | WorkgroupMemory, matching what
workgroupMemoryBarrier does (just at Subgroup scope).
* CUDA : block_memfence(), patched to llvm.nvvm.membar.cta (__threadfence_block).
Workgroup-scope, hence over-strict for the subgroup-scope ask but
correct - a CTA-scope fence orders memory across the whole CTA, of
which the subgroup is a strict subset.
* AMDGPU : LLVM 'fence syncscope("workgroup") seq_cst' - lowers to the appropriate
s_waitcnt / cache-flush sequence. Same workgroup-scope over-strictness
note.
Tests
-----
test_subgroup_sync and test_subgroup_mem_fence flip from arch=qd.vulkan to
arch=qd.gpu and now run on every GPU backend. They are smoke tests: they verify
the kernel traces, codegens, and runs without error. We do not attempt to
construct a producer/consumer race that only the fence makes legal - that kind of
test is hard to write portably and easy to make flaky.
Doc updates
-----------
The Identification-and-control table now shows yes for sync() / mem_fence() across
all backends, with a footnote on mem_fence() pointing out the workgroup-scope
over-strictness on CUDA / AMDGPU. The semantics section spells out the per-backend
lowering and the uniform-CF caller contract.
…s CUDA + AMDGPU + SPIR-V
Closes the last two `no` cells in the Identification-and-control matrix in subgroup.md.
Both ops now lower correctly on every GPU backend.
group_size()
------------
* CUDA: returns the static constant 32 (warp size on every supported NVIDIA arch).
* AMDGPU: emits llvm.amdgcn.wavefrontsize; the AMDGPU backend folds it to 32 or 64
based on the function's +wavefrontsize32/+wavefrontsize64 target feature.
* SPIR-V: unchanged - was already querying OpSubgroupSize.
elect()
-------
Reimplemented as a @qd.func wrapper:
@func
def elect():
return i32(invocation_id() == 0)
Inlines at trace time into compare + zext on every backend. Replaces the SPIR-V-only
OpGroupNonUniformElect path with a portable definition.
Semantic change worth flagging
------------------------------
OpGroupNonUniformElect is allowed to elect any *active* lane and may pick a different
lane on different invocations. The new wrapper deterministically elects lane 0.
Under qd.simt.subgroup's documented uniform-CF + all-lanes-active contract this is
strictly compatible (lane 0 is always active and is a legal SPIR-V choice), and it
makes the behaviour identical across backends. Grepped the codebase before changing -
no internal caller depends on the broader OpGroupNonUniformElect semantics.
Tests
-----
* test_subgroup_group_size: every lane writes group_size() into a buffer; the result
must be uniform across lanes and in {32, 64}.
* test_subgroup_elect: writes elect(), invocation_id(), and group_size() into per-lane
slots, then asserts (a) elect() is in {0, 1}, (b) elected lanes are exactly the
invocation_id == 0 lanes, and (c) the elected count equals N / group_size.
Both parametrized over arch=qd.gpu so they run on every available GPU backend.
Doc
---
subgroup.md matrix flips both rows to yes-on-all. Semantics sections describe each
backend lowering and call out the elect() lane-0-pinning narrowing of SPIR-V.
… + AMDGPU + SPIR-V Replaces the SPIR-V-only `subgroup.inclusive_add(v)` with a portable sized variant implemented as a `@qd.func` Hillis-Steele scan over `shuffle_up`. This is the first slice of the planned migration of the inclusive_* / exclusive_* ops to a universal sized API; the other 6 inclusive_* ops still take `(value)` and lower via OpGroupNonUniformInclusiveScan on SPIR-V only. Implementation -------------- @func def inclusive_add(value, log2_size: template()): lane_in_group = invocation_id() & ((1 << log2_size) - 1) for i in static(range(log2_size)): offset = static(1 << i) partner = shuffle_up(value, u32(offset)) if lane_in_group >= offset: value = value + partner return value * `shuffle_up` is in uniform CF (every lane participates) - matches its documented contract on every backend. * The `if lane_in_group >= offset` is per-lane arithmetic - no subgroup op inside the conditional. * Cross-group `shuffle_up` partners are masked off by the lane_in_group guard, so groups smaller than the full subgroup compose correctly when log2_size < log2(group_size). Backend cleanup --------------- * Dropped `subgroupInclusiveAdd` from the SPIR-V codegen `inclusive_scan_ops` set in `quadrants/codegen/spirv/spirv_codegen.cpp` - that path is now unreachable for `inclusive_add`. The other 6 inclusive ops still go through that branch. * Dropped `PER_INTERNAL_OP(subgroupInclusiveAdd)` from internal_ops.inc.h and `POLY_OP(subgroupInclusiveAdd, ...)` from type_system.cpp. No SPIR-V fast path left to keep alive. Internal caller fix ------------------- `quadrants.algorithms.PrefixSumExecutor` was passing `subgroup.inclusive_add` as a template-callable to `scan_add_inclusive`, which invokes it as `inclusive_add(val)` with one argument. After the API change this would TypeError. Added a single-arg adapter `subgroup_inclusive_add_warp_i32` next to `warp_shfl_up_i32` in `_kernels.py` that calls `subgroup.inclusive_add(val, 5)` (log2_size=5 -> 32-lane warp/wave scan, matching WARP_SZ in the kernel), and routed the Vulkan branch to the adapter. The CUDA branch still uses `warp_shfl_up_i32` for now. Tests ----- `test_subgroup_inclusive_add` (arch=qd.gpu, parametrized over `log2_size in 1..5` and `dtype in {i32, i64, u64, f32, f64}`): runs the scan and verifies each lane's result against a Python running sum. Doc --- * Matrix flips `inclusive_add` row to yes-on-all (with the same `*` AMDGPU perf-asterisk as `reduce_add`). * Top-of-section text and "Performance notes" updated to reflect that `inclusive_add` now has a portable sized form, while the other inclusive_* ops are still mid-migration. * The "Inclusive scan on SPIR-V" example now uses `inclusive_add(v, 5)` and works on every GPU backend.
… AMDGPU + SPIR-V Slice 2 of the inclusive_* / exclusive_* migration: extends the same portable @qd.func Hillis-Steele pattern from `inclusive_add` (slice 1) to the other six inclusive ops, sharing a single `_inclusive_scan` helper. Implementation -------------- @func def _inclusive_scan(value, op: template(), log2_size: template()): lane_in_group = invocation_id() & ((1 << log2_size) - 1) for i in static(range(log2_size)): offset = static(1 << i) partner = shuffle_up(value, u32(offset)) if lane_in_group >= offset: value = op(value, partner) return value @func def inclusive_add(v, log2_size): return _inclusive_scan(v, _bin_add, log2_size) @func def inclusive_mul(v, log2_size): return _inclusive_scan(v, _bin_mul, log2_size) ... (min / max / and / or / xor follow the same one-line pattern) The seven `_bin_*` are tiny @func wrappers around `+`, `*`, `min(a,b)`, `max(a,b)`, `a & b`, `a | b`, `a ^ b`. Each is passed as a template-callable to `_inclusive_scan` and gets inlined at trace time, so the public API has the same cost as the slice 1 inline scan: log2_size shuffle+op pairs, no runtime indirection. This refactors the existing `inclusive_add` (which lived inline in slice 1) onto the shared helper at the same time, so all seven scans live in one place. The externally-observable behaviour of `inclusive_add` is unchanged. Backend cleanup --------------- * Removed the entire `inclusive_scan_ops` / `OpGroupNonUniformInclusiveScan` branch from `quadrants/codegen/spirv/spirv_codegen.cpp` - all seven ops now go through the portable Python path on every backend, including SPIR-V. * Removed the six remaining `subgroupInclusive{Mul,Min,Max,And,Or,Xor}` entries from `internal_ops.inc.h` and `type_system.cpp`. Tests ----- * Added `test_subgroup_inclusive_{mul,min,max,and,or,xor}` (arch=qd.gpu), each parametrized over `log2_size in 1..5` and a per-op dtype list: - `_mul`: i32, f32, f64 (inputs clamped to [1, 4] so 32-way product fits i32). - `_min` / `_max`: i32, f32, f64 (varied non-monotonic inputs). - `_and` / `_or` / `_xor`: i32, i64, u64 (bit-varied inputs). * Refactored the existing `test_subgroup_inclusive_add` to share a small `_check_inclusive_scan` helper with the new tests; the dtype matrix is unchanged (i32, i64, u64, f32, f64). Doc --- * Matrix flips all six remaining `inclusive_*` rows to yes-on-all (with `*` for AMDGPU - same ds_bpermute perf note as `inclusive_add`). * Section header collapses the seven ops into a single block: same shape, only the operator differs. * Performance notes call out that `OpGroupNonUniformInclusiveScan` is no longer used on SPIR-V even though it was supported - the trade-off is uniform cost across backends. The `exclusive_*` ops are still TODO stubs - that's slice 3.
…s i32 The previous `(i % 4) + 1` pattern produced cycles of 1*2*3*4 = 24 per group of 4; over 28 lanes that's 24^7 ≈ 4.6e9, which overflows i32 (and was the only failure in the cuda-side slice 2 run). Replace with `2 if i % 4 == 0 else 1`: max 8 twos in 32 lanes → product ≤ 2**8 == 256, well within i32 and exact in f32.
Slice 3 (final) of the inclusive_* / exclusive_* migration: replaces the seven TODO-stub `exclusive_*` functions with portable @qd.func implementations layered on top of the inclusive scans from slice 2. Implementation -------------- @func def _exclusive_scan(value, op: template(), identity, log2_size: template()): inc = _inclusive_scan(value, op, log2_size) shifted = shuffle_up(inc, u32(1)) lane_in_group = invocation_id() & ((1 << log2_size) - 1) result = shifted if lane_in_group == 0: result = identity return result The lane-0 substitution is required: `shuffle_up` with offset 1 is implementation-defined at lane 0 (and `OpGroupNonUniformShuffleUp` calls it undefined outright), so we cannot rely on whatever the hardware happens to produce there. Identity per op is supplied as a runtime expression in `value`'s dtype, derived from `value` itself so the wrapper does not need to inspect the dtype: add: value - value (zero) mul: value - value + 1 (one - the literal +1 takes value's dtype) or: value ^ value (zero) xor: value ^ value (zero) and: ~(value ^ value) (all bits set) For `min` and `max` there is no portable type-extreme that can be derived from `value` alone, so those two ops take an explicit `identity` argument: exclusive_min(v, log2_size, identity) # pass +inf or dtype max exclusive_max(v, log2_size, identity) # pass -inf or dtype min Cost per call: one inclusive scan (`log2_size` shuffle+op pairs) plus one extra `shuffle_up` and a per-lane select. Tests ----- * Added `test_subgroup_exclusive_{add,mul,min,max,and,or,xor}` (arch=qd.gpu), each parametrized over `log2_size in 1..5` and a per-op dtype list: - `_add`: i32, i64, u64, f32, f64 - `_mul`: i32, f32, f64 (inputs bounded so 32-way product fits i32) - `_min` / `_max`: i32, f32, f64 (caller passes explicit identity) - `_and` / `_or` / `_xor`: i32, i64, u64 * Shared `_check_exclusive_scan` helper drives the kernel launch, dtype skip, and per-lane verification: lane 0 must equal the supplied identity, lane k>0 must equal the op-reduce of `src[0..k]`. Doc --- * Matrix gains all seven `exclusive_*` rows, all yes-on-all (with `*` for AMDGPU same as inclusive_*). * New section describes the shared shuffle_up + select pattern, the per-op identity expressions, and why min/max take explicit identities. * The old "exclusive_*, all_true, any_true, all_equal" TODO-stub section is trimmed down to just the three remaining stubs.
… scans
Both `_check_inclusive_scan` and `_check_exclusive_scan` previously verified only
the first group's worth of lanes (lanes 0..group_size-1). Two coverage gaps:
1. For log2_size < 5, multiple independent groups of 2**log2_size lanes share
a single 32-lane subgroup. The `lane_in_group >= offset` mask is what
isolates them from each other - and that mask was completely untested.
A bug there would have silently passed.
2. The 64-lane launch produces two independent 32-lane subgroups (lanes 0-31
and 32-63) running the same scan side by side. Cross-subgroup leakage
in the underlying shuffle_up (e.g. an AMDGPU ds_bpermute with the wrong
mask) would not have been caught.
Both helpers now iterate over every (group, in-group-lane) pair across the full
64-lane launch and verify the expected per-lane value, recomputing the running
op-reduce from `src[group_base..]` at each group boundary.
Coverage delta: with log2_size=1 the old test verified 2 of 64 lanes; the new
test verifies all 64. At log2_size=3, 8 of 64 -> 64 of 64. At log2_size=5,
32 of 64 -> 64 of 64 (still the same group_size, but the second subgroup is
now exercised).
Validated on the cluster: all 230 scan tests (115 inclusive + 115 exclusive)
pass with the extended verification on CUDA and on Vulkan; the slice 1/2/3
implementations were already correct, this just closes the test gap.
…al fix) `exclusive_*` scans all fail on the Metal backend (via MoltenVK), with the `got` value at lane 1 of each group being whatever the inclusive scan would produce *if the lane-0 conditional update had been applied unconditionally* (eg. `inc[0] = src[0] op src[0]` instead of `inc[0] = src[0]`). For non-idempotent ops this is visibly wrong; for `and`/`or` it accidentally matches at group 0 because `x op x = x`. Inclusive scans pass because nothing downstream re-reads `inc[0]` across lanes. Root cause is reconvergence in MoltenVK's SPIR-V → MSL lowering of the pattern `if lane_in_group >= offset: value = op(value, partner)` followed by another subgroup op (the next loop iteration's `shuffle_up`, or the `shuffle_up(inc, 1)` inside `_exclusive_scan`): lanes that took the false branch end up reading stale register state from the subsequent shuffle. Fix: replace both conditional updates (`if`-then-assignment) with `qd.select`, which lowers to `OpSelect` and keeps every lane in straight-line code. `op(value, partner)` is pure so unconditional evaluation is safe. Adds a comment explaining the choice. Validated: - CUDA simt scans: 280/280 pass - Vulkan simt scans: 280/280 pass - CUDA scan+sort: 65/65 pass - Vulkan scan+sort: 65/65 pass
Replaces the long-standing TODO stubs with portable @qd.func implementations plus a CUDA fast path at full-warp size. API: - `subgroup.all_true(predicate, log2_size)` -- AND-reduce `predicate != 0` across each `2**log2_size` group, returns `i32(0|1)` broadcast to every lane. - `subgroup.any_true(predicate, log2_size)` -- OR-reduce, same shape. - `subgroup.all_equal(value, log2_size)` -- broadcast group-lane-0's value, AND-reduce per-lane equality bit. Equality is the backend's native `==` (NaN != NaN, +0.0 == -0.0), matching SPIR-V `OpGroupNonUniformAllEqual`. CUDA shortcut: at trace time, `qd.static()` on `current_cfg().arch` plus the compile-time `log2_size` selects `cuda_all_sync_i32` / `cuda_any_sync_i32` when `log2_size == 5`, so full-warp uses lower to a single `vote.all` / `vote.any` instruction with no branch in the IR. `all_equal` inherits the shortcut transitively via `all_true`. We deliberately do not wire `__match_all_sync` because it requires sm_70+ and uses bit-equality on floats, contradicting the documented `OpGroupNonUniformAllEqual` semantics. Every other backend (Vulkan, Metal, AMDGPU), and CUDA at `log2_size < 5`, falls back to a portable `shuffle_xor` butterfly: `log2_size` shuffles plus `log2_size` ANDs / ORs, fully unrolled into the calling kernel's IR (same shape as `reduce_all_add`). No C++ codegen changes. Tests cover all-true / all-false / one-odd-lane-in-one-group / sparse-pattern scenarios for `all_true` and `any_true`, and all-same / all-distinct / same-per-group / one-outlier-per-group for `all_equal`. Each scenario verifies every group across the full 64-lane launch (so the launch spans two CUDA / Metal / RDNA subgroups, exercising both partial-subgroup multi-group and cross-subgroup behaviour). Validated: - CUDA simt: 369/370 (+ 1 expected skip) - Vulkan simt: 350/370 (+20 expected MoltenVK skips) - CUDA scan+sort: 65/65 - Vulkan scan+sort: 65/65 Doc: `docs/source/user_guide/subgroup.md` updated -- support matrix, dedicated section per op, and CUDA-shortcut rationale.
The previous commit replaced `if` with `qd.select` in the scan helpers, but `OpSelect` on MoltenVK/Metal silently returns the false-branch value when an operand is an f32 produced by a shuffle intrinsic. Revert `_inclusive_scan` back to `if`, which works correctly on its own. For `_exclusive_scan`, restructure to shift the input before the inclusive scan (shuffle_up → fill lane 0 with identity → inclusive scan) instead of running the inclusive scan then shuffling the result. The old pattern triggered a separate Metal SPIR-V misoptimization where the register holding the inclusive result was clobbered when only consumed by a shuffle intrinsic. Co-authored-by: Cursor <cursoragent@cursor.com>
Two coverage gaps surfaced during a post-merge audit: * `all_true` / `any_true` were only ever exercised with predicate values 0 or 1, so the `i32(predicate != 0)` cast was untested. Adds a `nonbinary-mixed` scenario (`[((i*17) % 13) - 6 for i in range(N)]` -- mixes 0, positives, and negatives) to both tests. * `all_equal` on floats was documented as "NaN != NaN, +0.0 == -0.0" (matching `OpGroupNonUniformAllEqual`) but no test pinned the contract down. Adds `test_subgroup_all_equal_float_contract` (f32 + f64 x log2_size 1..5) covering: ±0 mixed in every group -> 1; NaN at every group start -> 0; NaN at a single lane -> only that group is 0; all NaN -> every group 0. These also lock the door against a future refactor swapping in `__match_all_sync` on CUDA (which would silently regress to bit-equality on floats). Validated: 45/45 voting tests on CUDA and Vulkan (was 35/35 + 10 new from the float contract scenarios).
* black auto-reformats in `subgroup.py` and `test_simt.py` (line-length=120 per `.pre-commit-config.yaml`). * clang-format auto-reformats in `codegen_amdgpu.cpp` and `spirv_codegen.cpp`. * Drop unused `from quadrants.lang.simt import subgroup` from `_algorithms.py` (left over after the switch to `subgroup_inclusive_add_warp_i32`); ruff re-sorts the remaining import block. * Extend the file-level pyright comment in `subgroup.py` from `reportInvalidTypeForm=false` to also disable `reportOperatorIssue` so that `p & shuffle_xor(...)` / `p | shuffle_xor(...)` in the new voting ops don't trip pyright on `Expr` operator overloads — same false-positive class the existing suppression already covers. Pre-commit (black, clang-format, ruff, pylint, trailing-whitespace, end-of-file) clean. Pyright is down to 6 pre-existing errors in files this branch does not touch (`_tensor_wrapper.py`, `_func_base.py`, `_metal_interop.py`, all from PR #618 / streams work) — net 0 new errors attributable to this branch.
The voting / scan / data-movement work landed with prose wrapped at the AI-default ~80-95c instead of the project's 120c (per `pre-commit` black config `-l 120`). Reflow the affected runs in: * `python/quadrants/lang/simt/subgroup.py` — module-level voting / inclusive / exclusive backend-strategy comments, plus `elect`, `all_true`, `any_true`, `all_equal`, `broadcast_first`, `_inclusive_scan`, all `inclusive_*` / `exclusive_*` op docstrings, and `_exclusive_scan` / `shuffle_xor`. * `tests/python/test_simt.py` — voting / scan section comments, scan verification rationale, voting predicate-truthy / float-contract notes, `test_subgroup_sync` / `_mem_fence` / `_group_size` / `_elect` / `_barrier_deprecation_warn_once` / `_memory_barrier_deprecation_warn_once` docstrings. * `python/quadrants/_kernels.py` — `subgroup_inclusive_add_warp_i32` adapter docstring. * `python/quadrants/algorithms/_algorithms.py` — comment explaining the warp-i32 adapter usage in `PrefixSumExecutor`. No semantic changes; black / pre-commit / pyright still clean. Audited via `find_underwrapped --diff origin/main`: remaining flagged runs are all at ~110-120c (only minor packing imbalance, max ≤ 123c) — no AI-default 80c under-wrapping in this branch's diff.
The CI wrap-checker flagged three C++ comment blocks in PR #665 still wrapped near ~80c (`runtime.cpp:1033`, `runtime.cpp:1136`, `codegen_amdgpu.cpp:507`). While in there I audited the rest of the new C++ subgroup commentary and the per-op intrinsic notes, and reflowed them to the project's 120c target. Also tightened a couple of Python lines that crept past 120c (one f-string docstring, one explanatory comment in test_simt.py). No semantic changes.
CI wrap-checker on PR #665 flagged three more docstring blocks wrapping at 83-87c instead of 120c (`exclusive_add`, `test_subgroup_sync`, `test_subgroup_mem_fence`). Reflow those. No semantic changes.
Stale carry-over from the days when several ops were one-backend stubs; no longer applies now that everything in the doc is universal.
Stacked on hp/cross-gpu-subgroup; same shape as the existing `reduce_add` / `reduce_all_add` pair: * `reduce_min(v, log2_size)` / `reduce_max(v, log2_size)` — `shuffle_down` tree, result valid in lane 0 of each `2**log2_size` group. * `reduce_all_min(v, log2_size)` / `reduce_all_max(v, log2_size)` — `shuffle_xor` butterfly, result broadcast to every lane. Both forms unroll into exactly `log2_size` shuffle+min (or `+max`) pairs in the calling kernel's IR — no kernel-launch overhead, no separate runtime symbol. Lowers to backend-specific min/max intrinsics (`fminnm` / `fmaxnm` on PTX, `llvm.minnum` / `llvm.maxnum` on AMDGPU, `OpFMin` / `OpFMax` on SPIR-V); float-NaN handling is documented as implementation-defined. Tests: parametrized as `qd.gpu` over `i32` / `i64` / `u64` / `f32` / `f64` and `log2_size` in `[1..5]`, verifying every group across the full 64-lane launch. Doc: new rows in the `subgroup.md` Reductions/scans table; new per-op sections; the "removed" note is updated to drop `reduce_min` / `reduce_max` (now portable).
Implement a portable ballot operation that returns a u32 bitmask where bit i is set if lane i's predicate is non-zero. Works across CUDA (__ballot_sync), AMDGPU (amdgcn_ballot.i32), and SPIR-V/Vulkan (OpGroupNonUniformBallot). Follows the same cross-backend pattern as subgroup.shuffle: a single Python API (subgroup.ballot) dispatches to the appropriate backend intrinsic at codegen time. On AMDGPU CDNA with 64-wide wavefronts only the low 32 bits are returned, consistent with the u32 return type.
Mac OS X build was failing because spirv_codegen.cpp was accessing IRBuilder::t_v4_uint_ directly, which is a private member. Add a public v4_u32_type() accessor following the existing pattern (u32_type(), bool_type(), etc.) and use it from the ballot lowering.
Per-lane inclusive sum scoped to 2**log2_size lanes, where every lane with head_flag != 0 resets the running sum. Useful for stream compaction and sparse / variable-length records. Implementation: one subgroup.ballot(head_flag != 0) to materialise a u32 of head positions, then a Hillis-Steele inclusive sum bounded by `distance >= offset` (distance = lane - segment_head, segment_head from 31 - clz(effective_mask & ((1 << (lane + 1)) - 1)) with a virtual head OR-injected at group_base so lower is always non-zero). Cost: 1 ballot + 1 clz + log2_size shuffles + log2_size adds, fully unrolled. Same shape as inclusive_add with a single-instruction setup. Tests: parametrized over the standard dtypes (i32 / i64 / u64 / f32 / f64) and log2_size in [0..5], plus three contract tests (no head flags -> equivalent to inclusive_add; every lane is a head -> output equals input; truthy non-binary head_flag values). Doc: new row in the Reductions/scans table; new per-op section after reduce_all_min / reduce_all_max.
`qd.clz(u32_value)` was emitting QD_NOT_IMPLEMENTED on CUDA and produced undefined results on SPIR-V (GLSL.std.450 FindSMsb is undefined for the all-bits-set case). The new `subgroup.segmented_reduce_add` is the first user of `clz` in the codebase and exposed both bugs. * CUDA: route u32 / u64 inputs through the same `__nv_clz` / `__nv_clzll` intrinsics used for i32 / i64 — the underlying bit pattern is what matters, the C declaration on signed types is a header-level convention. * SPIR-V: dispatch to FindUMsb (#75) for unsigned inputs and FindSMsb (#74) for signed. The two GLSL.std.450 instructions return a value of the same type as their operand, so add an explicit OpBitcast back to i32 before the `32 - msb - 1` subtraction (otherwise SPIR-V's strict-type `sub` asserts on mixed i32 / u32). * Python: in `segmented_reduce_add`, wrap `clz`'s result in `i32(...)` so the subsequent arithmetic is uniformly signed-32-bit (the trace- time tracer would otherwise propagate u32 from the input through to the subtraction, hitting SPIR-V's same-type assertion). Tests: `subgroup.segmented_reduce_add` tests now pass on CUDA + Vulkan across i32 / i64 / u64 / f32 / f64 and `log2_size` in [0..5], including the all-heads, no-heads, and truthy-predicate edge cases.
New naming convention for ``qd.simt.subgroup``: - **Full-subgroup ops have no suffix.** ``reduce_add(v)`` / ``all_true(p)`` / ``inclusive_add(v)`` / etc. take no ``log2_size`` argument and operate over every lane in the active subgroup. - **Tiled (multi-window) ops have a ``_tiled`` suffix.** ``reduce_add_tiled(v, log2_size)`` / ``all_true_tiled(p, log2_size)`` / etc. split the subgroup into independent ``2**log2_size``-aligned windows. This flips the previous convention where the no-suffix forms took ``log2_size`` and ``_full``-suffixed forms covered the whole subgroup. The renames are mechanical: ``<op>(v, log2_size)`` -> ``<op>_tiled(v, log2_size)`` and ``<op>_full(v)`` -> ``<op>(v)``, applied to 26 ops across the reduce / scan / vote families plus the three private helpers (``_inclusive_scan`` / ``_exclusive_scan`` / ``_segment_head_distance``). ``ballot_full`` also drops its ``_full`` (full-subgroup -> no suffix); ``ballot_full_subgroup`` keeps its deprecation alias forwarding to the new ``ballot``. This is a breaking change for any caller that hard-coded ``subgroup.reduce_add(v, 5)`` (or similar): either drop the argument (full-subgroup form, picks ``log2_size`` from ``log2_group_size()`` per backend) or add the ``_tiled`` suffix to keep the old behaviour on wave64. Updated callers in this repo: ``_kernels.subgroup_inclusive_add_warp_i32`` now wraps ``inclusive_add_tiled(val, 5)`` (preserving the original warp-scoped semantics); the prefix-sum executor in ``algorithms/_algorithms.py`` renames its scan-primitive local. Doc + test renames track the API rename throughout: the support matrix tables in ``subgroup.md`` flip column order (no-suffix form primary, ``_tiled`` form secondary), the "Full-subgroup (_full) variants" section is rewritten as "Full-subgroup no-suffix wrappers", and every test function in ``test_simt.py`` that exercises a renamed op picks up the matching ``_tiled`` / no-suffix name.
…framing Restructures subgroup.md so the full-subgroup form is the natural primary form throughout, instead of constantly being called out as the "no-suffix" wrapper alongside the ``_tiled`` form. Concrete changes in ``subgroup.md``: - Drop the dedicated "Full-subgroup no-suffix wrappers" section. The mapping table it contained was just enumerating the suffix rule. - Drop the standalone "How ``log2_size`` windowing works" section near the top. The doc no longer needs to explain windowing upfront -- windowing only matters when the reader specifically reaches for a ``_tiled`` variant. - Add a single "Tiled variants" section that consolidates the windowing table, the result-placement notes (broadcast-to-all vs. window-local -lane-0), and the lowering / cost details. This is the only place in the doc that has to explain how ``log2_size`` carves up a subgroup. - Flip the per-op semantic sections (``### reduce_add(value)`` etc.) so the heading and prose lead with the full-subgroup form and the ``_tiled`` variant is mentioned in a single sentence with a link to the new section. Same applies to ``reduce_all_*``, ``reduce_min/max``, ``segmented_reduce_*``, ``inclusive_*``, ``exclusive_*``, ``all_true``, ``any_true``, ``all_equal``. - Drop the ``Tiled form`` column from both support matrices (voting / reductions); it duplicated info the suffix rule already conveys. Tables now have a single op column with the full-subgroup form, plus a one-liner under the table pointing at the Tiled variants section. - Rewrite the naming-note bullets to spell out the convention once (full subgroup => no suffix; tile => ``_tiled``) and drop repetitive "no-suffix" phrasing. Code-side polish (no API changes): drop "no-suffix" phrasing from the ``ballot_full_subgroup`` deprecation message + docstring, the full-subgroup wrappers section comment in ``subgroup.py``, the ``group_size()`` docstring, ``program.h``'s comment, and the ``_check_full_matches_tiled`` test-helper section comment.
…ants section Add a complete table of every ``_tiled`` op in the Tiled variants section, mapping each to its full-subgroup form and tagging its result placement (broadcast-to-all vs. window-local-lane-0). Replaces the prose bullets that previously listed the families without enumerating individual ops, so callers can scan the whole tiled API surface in one place. Also folds the ``log2_size`` template-parameter contract (compile-time ``int``, range per op family, no runtime check on overshoot) into the section so the reader doesn't have to chase it across the per-op sections.
Remove the bullet pointing at the CUDA-only ``qd.simt.warp.*`` namespace from the See also list -- it's no longer a useful redirect now that the portable ``subgroup.*`` API covers the same ground.
In the Performance notes section, refer to ``reduce_add`` / ``reduce_all_add`` / ``inclusive_*`` directly instead of their ``_tiled`` counterparts. The cost figures are identical between the two forms (both compile down to ``log2_group_size()`` shuffles), so the full-subgroup names are the cleaner reference -- and consistent with the "full subgroup is the default" framing the rest of the doc now uses. The cost contract for the ``_tiled`` form is captured in a single clarifying tail clause on the ``reduce_*`` bullet.
The identity for an exclusive scan is uniquely determined by the operator and value's dtype, so requiring callers to pass it for min / max only (while the other five exclusive ops derive it implicitly via dtype-preserving arithmetic on `value`) was an inconsistency rather than a feature. Other library APIs that take an explicit `init` (Thrust, CUB, std::exclusive_scan) accept it for every op, not just min / max; HW group intrinsics (SPIR-V OpGroupNonUniform*, CUDA cg::exclusive_scan) take none at all. Our split-the-difference shape made min / max gratuitously different. Auto-derive the identity from value's dtype at compile time: - For real dtypes: +inf for min, -inf for max. - For integer dtypes: np.iinfo(dtype).max for min, np.iinfo(dtype).min for max (0 for unsigned max). Implementation: exclusive_min_tiled and exclusive_max_tiled drop their @func decorator and become plain Python wrappers that introspect value.ptr.get_rvalue_type(), emit a typed-constant identity Expr via make_constant_expr, and call the shared @func _exclusive_scan_tiled. The identity is a compile-time IR constant, so the generated code is identical to the prior hand-supplied form -- no runtime cost. The other five exclusive ops keep their existing @func + value-arithmetic identity (value - value, etc.). Doc / table updates collapse exclusive_{add,mul,min,max,and,or,xor} into a single row in the Reductions and scans table, and the per-op section now documents the dtype-derived identity per operator. Breaking change: callers passing identity to exclusive_min / exclusive_max (and their _tiled forms) must drop the argument.
Apply the same {brace,list} notation already used in the Reductions and scans
support matrix to two other tables:
1. Voting and predicate ops: collapse the 5 lanemask rows into a single
`subgroup.lanemask_{lt,le,eq,gt,ge}(lane_id)` row, and the all_true /
any_true rows into `subgroup.{all,any}_true(predicate)` (with the same
brace pattern threaded through the CUDA fast-path cell).
2. Supported _tiled ops table: collapse from 26 rows to 10 by family-grouping
reduce_{min,max}, reduce_all_{min,max}, segmented_reduce_{min,max},
inclusive_{add,mul,min,max,and,or,xor}, exclusive_{add,mul,min,max,and,or,xor}.
No semantic change -- only table layout.
Reorder the Reductions and scans support matrix (and the Supported _tiled ops
table, for consistency) so all three _add forms appear contiguously --
reduce_add, reduce_all_add, segmented_reduce_add -- followed by the three
_{min,max} forms. Previously these were interleaved by family
(reduce / reduce_all / segmented_reduce) which made it harder to scan for
"is _add supported here?" vs "are _{min,max} supported?".
The full AMDGPU shuffle lowering (ds_bpermute / permlane64 / cycle counts) is already covered in the Performance notes section at the end of the page. Shrink the asterisk-footnote under the Data movement table to a one-line pointer to that section so we don't tell the same story twice.
The Voting and predicate ops table now describes only full-subgroup ops, but the paragraph that follows the table mentioned the all_true_tiled / any_true_tiled fallback at log2_size == 5. That tiled detail is already covered in the per-op semantic section (`all_true(predicate)` / `any_true(predicate)`). Drop it from the table's caption so the text matches the table's full-subgroup scope.
…ps table Every row in the table just stripped the _tiled suffix and the log2_size argument to recover the full-subgroup form -- pure mechanical mapping, no information not already implied by the op name. Drop the column.
…-all -> broadcast-to-tile) The _tiled suffix on the public API names the tile, but the prose was still calling them "windows" in many places and "broadcast-to-all" was ambiguous about whether the broadcast was within a tile or across the whole subgroup. Standardize on: - "tile" instead of "window" everywhere (prose, table column headers, bullets, per-op semantic sections, the log2_size table, the _tiled section heading copy in subgroup.py and test_simt.py). - "broadcast-to-tile" instead of "broadcast-to-all" in the Result placement column (and the bullet that explains it). - "tile-local lane 0" instead of "window-local lane 0". No code or test behaviour changes; doc and comment vocab only.
The combined section interleaved each op's lowering bullets, which made it awkward to scan for "what does sync() lower to on AMDGPU?" without first sifting past the mem_fence bullets. Split into two sections, one per op, with each op's deprecation alias note attached to its own section.
Replace slash-separated section headers like
### lanemask_lt(lane_id) / lanemask_le(lane_id) / ...
with the same brace-expansion notation already used in the support matrices:
### lanemask_{lt,le,eq,gt,ge}(lane_id)
Applies to lanemask, all_true/any_true, reduce_{min,max},
reduce_all_{min,max}, segmented_reduce_{add,min,max}, inclusive_*,
exclusive_*. Cuts visual noise for the multi-op families and keeps the prose
style consistent with the tables.
The recent change to drop the explicit identity argument from
exclusive_min / exclusive_max means the wrapper now emits a dtype-typed
identity constant in IR (np.iinfo(dtype).max / .min, or +-inf for floats).
Pre-existing coverage only exercised i32 + floats, leaving three gaps:
- u32: np.iinfo(np.uint32).max = 4294967295 is above i32 max -- different
code path in make_constant_expr.
- u64: np.iinfo(np.uint64).max = 18446744073709551615 routes through
_clamp_unsigned_to_range's two's-complement branch (val > int64 max ->
val - (1<<64) = -1 in int64 transport). Highest-risk path.
- i64: wider int, exercises make_const_expr_int with int64 dtype.
Add _SCENARIOS_EXCLUSIVE_MINMAX = _SCENARIOS_I32_AND_FLOATS +
[(i64,5), (u32,5), (u64,5)] and use it for both _tiled tests. Replace the
inline int(np.iinfo(np.int32).max/min) literal with two small helpers
(_exclusive_min_lane0_identity / _exclusive_max_lane0_identity) that pick
the right np.iinfo width per dtype. Also extend _INT_DTYPES to include
qd.u32 so the existing got==expected comparison path covers it.
The full-subgroup test_subgroup_exclusive_{min,max} tests stay i32-only;
they just sanity-check that the bare wrapper matches the _tiled form at
log2_group_size(), which is dtype-agnostic.
Mirror the existing pure-Python deprecation tests for subgroup.barrier() and subgroup.memory_barrier() -- monkeypatch the forwarding target (here: ballot), call the deprecated name three times, and assert exactly one DeprecationWarning is emitted with the expected message text and that the predicate forwards through. Catches accidental message-text regressions and the warn-once flag wiring.
CI flagged line 66 as wrapped at 77 chars while sibling comment lines run 115-119; pull the next token onto line 66 so the paragraph uses the full width consistently.
# Conflicts: # docs/source/user_guide/atomics.md
71 occurrences across the file (` - ` style throughout) to keep the source ASCII-clean.
The empty inline asm ``__asm__ volatile("" : "+v"(byte))`` in
``amdgpu_cross_half_shuffle_i32`` pins ``byte`` to a VGPR so LLVM does
not fold the ``ds_bpermute`` call into a wave-uniform
``v_readlane_b32`` when ``target_lane`` is a compile-time constant.
clang validates the constraint against its current target before
emitting bitcode; x86 accepts ``v`` (SSE register) and amdgcn accepts
it (VGPR), but AArch64 rejects it outright with "invalid output
constraint", breaking the manylinux ARM build.
Restrict the gate to ``ARCH_amdgpu && (x86 || amdgcn)`` so ARM hosts
still compile ``runtime_amdgpu.bc``. The trade-off: ARM-built wheels'
AMDGPU bitcode loses the constant-``target_lane`` VGPR hint, but the
per-lane case (the common one) still emits a real ``ds_bpermute_b32``
because uniformity analysis sees per-lane inputs.
In ``_typed_min_identity`` / ``_typed_max_identity`` the ``npty`` variable is inferred by pyright as a union over every numpy scalar class that ``to_numpy_type`` can return (incl. ``np.float32`` / ``np.float64``). The earlier ``is_real(dtype)`` and ``npty is np.bool_`` guards narrow the runtime case to the integer types, but pyright cannot follow that through. Add an explicit ``assert issubclass(npty, np.integer)`` before the ``np.iinfo(npty)`` call so pyright sees a valid overload match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Note: this is BREAKING because we rename the tiled reduce operations to have a _tiled suffix, and put back the reductions without hte log2_size parameter, wihout a suffix.
Adds a single, consistent set of new SIMT-subgroup query / inter-processor-communication ops to
qd.simt.subgroup, all working portably across CUDA, AMDGPU, and SPIR-V (Vulkan / Metal):reduce_min,reduce_max(lane 0 of each2**log2_sizegroup),reduce_all_min,reduce_all_max(broadcast to every lane).shuffle_downtree /shuffle_xorbutterfly patterns; same shape asreduce_add/reduce_all_add.subgroup.ballot(predicate)returns au32bitmask (bitiset iff lanei's predicate is non-zero).__ballot_sync(CUDA),v_ballot_b32(AMDGPU),OpGroupNonUniformBallot(SPIR-V).subgroup.segmented_reduce_add/segmented_reduce_min/segmented_reduce_max(value, head_flag, log2_size) run a per-lane inclusive scan that resets at every non-zerohead_flag, scoped to2**log2_sizeconsecutive lanes.ballotto materialise the head bitmask, oneclzto find each lane's segment head, then a Hillis-Steele inclusive scan bounded bydistance >= offset. Cost:1 ballot + 1 clz + log2_size shuffles + log2_size ops.exclusive_min/exclusive_max): the per-lanedistance >= offsetguard ensures the scan never crosses a segment boundary, so a partner from another segment is never combined with the local value.subgroup.lanemask_lt(lane_id)/_le/_eq/_gt/_ge: closed-formu32masks parametrised by a lane id, mirroring CUDA's__lanemask_{lt,le,eq,gt,ge}but generalised to take any lane_id (passinvocation_id()for the CUDA built-in form).@qd.funcarithmetic — no backend intrinsic, no shuffle, no ballot — so per-lane-varyinglane_idworks the same as a uniform one.lane_idin[0, 31]; on AMDGPU CDNA wave64 the mask covers only the low 32 lanes (build a 64-bit mask from twou32ballots if needed).Drive-by fixes (required by
segmented_reduce_*, but useful in their own right)qd.clzis the first user ofclzin the codebase, and exposed bugs on every backend:__nv_clz/__nv_clzllare declared on signed types but operate on the underlying bit pattern; route u32 / u64 through them soqd.clz(u32(...))no longer hitsQD_NOT_IMPLEMENTED.emit_extra_unaryhad noclzcase; map to LLVM'sIntrinsic::ctlzwithis_zero_undef=0.FindSMsb([Build] Add gc before each unit test, to prevent ndarray issues #74, signed) andFindUMsb([Build] Reduce concurrency #75, unsigned). The unsigned form is required for u32 / u64 inputs whose top bit may be set;FindSMsbis undefined for those (treats them as negative; "most-significant 0-bit" doesn't exist for0xFFFFFFFF). Cast the result back toi32before the32 - msb - 1subtraction so SPIR-V's strict-typesubis happy.Stacking
This PR is stacked on top of #665 (
hp/cross-gpu-subgroup).It supersedes #600 (
hp/cross-gpu-ballot), whose three commits are cherry-picked here unchanged. #600 can be closed once this lands (or once #665 lands, whichever is convenient).Test plan
pyright(project config) clean for new code; pre-existing errors in untouched files unchanged.test_simt.py(586 passed, 1 skipped) green.test_simt.py(567 passed, 20 skipped) green.test_subgroup_*(566 passed, 21 deselected) green.find_underwrapped.py.Made with Cursor