[Atomics] New QIPC ops for atomics#690
Conversation
Pins the doc claim that vector / matrix arguments to qd.atomic_add fan out to one scalar atomic per component on every GPU backend. The existing test_atomic_float_ops only covers scalar ndarray destinations.
Pins the doc claim that qd.atomic_and / atomic_or / atomic_xor reject f32 / f64 dtypes at trace time. Parametrized over (op, dtype) on qd.gpu; f64 is gated through skip_if_f64_unsupported.
Pins the per-dtype doc table for atomic_add (i32 / u32 / i64 / u64) and atomic_and / atomic_or / atomic_xor (u32 / i64 / u64) on qd.gpu. Existing coverage was effectively i32-only for these paths.
Pins the doc claim that atomic_mul (CAS loop on every backend) works correctly under multi-thread contention and supports f64. Previous mul tests were all single-thread or used integer shared memory only.
Pins the doc claim that floating-point atomic_min / atomic_max behave as minNum / maxNum: when exactly one operand is NaN, the non-NaN value wins. Both arg orders covered for f32 and f64. CPU is left out of scope per the doc note about its order-dependent CAS-loop path.
Removes the arch=[qd.cpu, qd.cuda, qd.amdgpu] allowlist and the "# TODO(): Vulkan support" markers on the f16 atomic tests. The SPIR-V f16-atomic-cap gate is preserved via _skip_if_no_f16_atomic_on_spirv, matching the pattern in test_atomic.py::test_atomic_float_ops. CPU coverage of f16 atomics is retained by test_atomic_float_ops.
When run on CUDA today, qd.atomic_and(f32_field, ...) does not raise at trace time as the doc claims; it slides past AtomicOpExpression::type_check (which only checks PrimitiveType, not is_integral for the bit_* ops) and SIGABRTs inside launch_kernel. Skip the test with a TODO so the pytest worker survives, but keep the test body in place so it can be re-enabled the moment the frontend trace-time check is added.
Adds an is_integral check in AtomicOpExpression::type_check for
bit_and / bit_or / bit_xor so that qd.atomic_and / atomic_or /
atomic_xor on f32 / f64 raise QuadrantsTypeError at trace time
instead of sliding through codegen and SIGABRT'ing in launch_kernel
on CUDA. Matches the doc claim on docs/source/user_guide/atomics.md
("Integer dtypes only -- passing f32 / f64 raises a type error at
trace time").
Unskips test_atomic_bitwise_on_float_field_raises (6 cases).
Metal's RHI (quadrants/rhi/metal/metal_device.mm) sets `spirv_has_atomic_int64` on Apple7+ / Mac2 via the (misnamed) `feature_floating_point_atomics` gate, but MSL only ever exposes 64-bit atomics as `atomic_fetch_min/max` on `uint64`, starting at Apple GPU family 9 (M3 / A17). `atomic_add` / `sub` / `mul` and the bitwise family don't exist for 64-bit on any Apple GPU, so pipeline create fails with `RhiResult=-1` and tears the worker down. Skip the affected dtype rows for `test_atomic_add_int_contention` and `test_atomic_bitwise_int_widths` on Metal via a small helper, and add a footnote to the atomics doc table calling out the Metal limitation. Tightening / removing the over-permissive cap in `metal_device.mm` is deliberately out of scope here -- the cap is also consumed by adstack fallbacks in `runtime/gfx`, so it needs its own audit. Verified on AMDGPU (MI300X): all 4 dtypes for atomic_add and all 3 dtypes x 3 ops for bitwise still PASS (no test gets skipped on non-Metal GPUs).
…labels
- Reflow seven multi-line `# Pins the doc claim ...` blocks in
test_atomic.py from ~70-82c (AI default) to ~110-115c (project 120c
target). No content changes; pure rewrap.
- Restore inline `# Parallel sum` / `# Serial sum` (and max / min)
labels in test_atomic_{add,max,min}_f16. They got dropped in
4fd39f8 when retargeting to qd.gpu but still describe the test
structure (parallel atomic accumulation vs serial reference loop).
…/ Metal Wires a new `qd.atomic_exchange(dest, val)` primitive end-to-end: unconditionally writes `val` into `dest`, returns the old value of `dest`. Lowers to one native instruction on every backend (CUDA `atomicExch`, AMDGPU `*_atomic_swap`, SPIR-V `OpAtomicExchange`, x86 `xchg`). Layers touched: - `AtomicOpType::xchg` enum + name table; carve-out in `atomic_to_binary_op_type` (xchg has no binary equivalent). - pybind binding `expr_atomic_xchg` and Python wrapper `qd.atomic_exchange` with docstring + example. - `demote_atomics`: special-case xchg to load + store(val) instead of going through the load + binary + store path that the other ops use. - LLVM codegen (covers CPU / CUDA / AMDGPU): one new entry in the AtomicRMW BinOp map for integers; explicit AtomicRMW Xchg case for f32 / f64 in `real_type_atomic`. - SPIR-V codegen (Vulkan / Metal): `OpAtomicExchange` branch in the integer path, plus a uint-bitcast path for f32 / f64 global-memory xchg that avoids any `spirv_has_atomic_float_*` cap dependency. Coverage gaps deliberately left for follow-up (called out in docs/source/user_guide/atomics.md): - `f16` xchg (would need f16-specific bitcast in real_type_atomic). - Workgroup-shared float xchg in SPIR-V (would need uint-backing analogous to the existing add CAS path). - i64 / u64 xchg on Metal (same hardware limitation as the rest of the 64-bit integer atomic family; the existing `_skip_if_no_int64_atomic_rmw` helper covers it). Tests (parametrized on `qd.gpu`, all green on CUDA, Vulkan, AMDGPU -- Metal verification deferred to the user's MacBook): - `test_atomic_exchange_returns_old_value` -- single-thread sanity over i32 / u32 / i64 / u64 / f32 / f64. - `test_atomic_exchange_swap_under_contention` -- N threads each xchg a unique value; assert no value is lost or duplicated (multiset invariant). - `test_atomic_exchange_vector_field_fanout` -- vector arg fans out per component, last-writer-wins. Doc: new row in the support table with footnotes for the gaps above, new `qd.atomic_exchange` semantics section with a take-ownership example, and an updated note relating xchg to the still-not-exposed `atomic_cas`.
test_api[qd] in tests/python/test_api.py is a sorted-equality check of `dir(quadrants)` against a hand-maintained allow-list. Forgot to add `atomic_exchange` to that list when wiring the new op, so the manylinux wheel-test job now fails with `+ 'atomic_exchange'`.
…test Two small follow-ups to the atomic_exchange PR: - demote_atomics.cpp: the "no binary equivalent" comment got wrapped at ~80c by the original edit. Reflow to one ~115c line so the line-wrap CI bot stops flagging it. - test_atomic_exchange_matrix_field_fanout: pins the doc claim that "Vector / matrix arguments fan out per component" for the matrix case too. Mirrors test_atomic_add_matrix_field_fanout exactly; the xchg vector test (test_atomic_exchange_vector_field_fanout) is already green on CUDA / Vulkan / AMDGPU and the matrix version walks the same OpAtomicExchange-per-component lowering with a 2x3 slot.
The xchg branch in DemoteAtomics::visit(AtomicOpStmt*) (introduced in commit 82d0572) is structurally different from every other atomic op: it lowers to a bare load + store(val) instead of load + binop + store, because the new value is independent of the old. Coverage bot flagged it as untested; the existing field-target xchg tests don't exercise it because field locations are never demoted. test_atomic_exchange_demoted mirrors test_atomic_add_demoted exactly (thread-local destination forces the demote pass to fire), and pins the demoted semantics: first xchg returns the initial value, second xchg returns the value the first xchg wrote, proving the demoted load + store round-trips correctly.
Wrap-check bot flagged three blocks that were wrapped at ~102-105c when they could pack to ~115c without exceeding the 120c project target: - ops.py atomic_exchange docstring (was 102/103/X -> now 117/111/66) - test_atomic.py xchg vector-fanout comment (was 105/X/X -> 118/106/93) - codegen_llvm.cpp xchg-FP comment (was 105/106/X -> 117/115/X) No semantic changes.
Comment inside the take-ownership atomic_exchange example was wrapped at ~79 chars (AI default), tripping the wrap-check bot. Rejoin onto a ~95-char line; whole comment now fits on two natural-sentence lines.
* expression_printer.h: add the missing `atomic_xchg` entry to `names_table` so AtomicOpType::xchg (= 8) no longer reads past the end of the array, and flip the bounds check from `>` to `>=` to make it correct for any future op (e.g. cas). Also fix a pre-existing latent bug where the `min` and `max` slots were swapped, so IR dumps now print the right op name. * stmt_op_types.cpp: document that `atomic_to_binary_op_type` intentionally has no entry for `mul` or `xchg` (they have no native binary form and callers MUST special-case them before dispatching here -- see demote_atomics.cpp). * spirv_codegen.cpp: promote the unsupported xchg fallbacks (shared float arrays, global f16) from a generic QD_NOT_IMPLEMENTED hard abort to clearer QD_ASSERT_INFO diagnostics that name the missing path.
# Conflicts: # docs/source/user_guide/atomics.md # tests/python/test_atomic.py
Wires a new `qd.atomic_cas(dest, expected, desired)` primitive across
all backends. Returns the value originally at `dest`; the user
recovers success with `(returned == expected)`. Matches CUDA
`atomicCAS` and SPIR-V `OpAtomicCompareExchange` shape.
CAS is the basic primitive on top of which arbitrary atomic
read-modify-write operations can be built with a retry loop -- the
xchg PR exposed unconditional swap, this one exposes the conditional
form.
IR plumbing (the only layer where CAS is structurally novel: it's
the first atomic op with three operands):
- AtomicOpType::cas enum + name table; intentionally NOT registered
in atomic_to_binary_op_type (no binary equivalent, like xchg).
- AtomicOpStmt grows an optional `Stmt *expected` field (default
nullptr, ignored by every other op_type) plus a 4-arg constructor.
Existing 3-arg constructor is unchanged so all callers keep
compiling. expected is added to QD_STMT_DEF_FIELDS so cloning /
serialization round-trips it.
- AtomicOpExpression similarly grows an `expected` Expr slot + 4-arg
constructor. type_check requires integer dtypes for all three
operands (first-pass scope; f32/f64 CAS via uint-bitcast deferred,
same gap xchg has). flatten threads the third operand into the
4-arg AtomicOpStmt constructor for the cas case.
Bindings + wrapper:
- pybind: m.def("expr_atomic_cas", ...) -> AtomicOpExpression(cas, ...).
- ops.py: new qd.atomic_cas(x, expected, desired) wrapper. Doesn't
fit @writeback_binary (3 args), so the lvalue check + wrap_if_not_expr
are inlined. Added to __all__.
- test_api.py: atomic_cas in the public-API allow-list.
Codegen:
- LLVM (CPU / CUDA / AMDGPU): one new branch in integral_type_atomic
that emits CreateAtomicCmpXchg and projects field 0 (loaded prior
value) -- single instruction on every target.
- SPIR-V (Vulkan / Metal): new CAS branch in the integer path emits
OpAtomicCompareExchange directly (operands: scope, sem_eq, sem_neq,
value, comparator -- comparator order is from the SPIR-V spec, not
intuitive). The existing internal CAS uses (used by atomic_op_using_cas
and the shared-array float CAS path) confirm both backends already
produce the right machine code; the only new work was a public path.
demote_atomics:
- thread-local CAS demotes to load + cmp_eq + select(cmp, val, load) +
store. Materializing select-then-unconditional-store avoids needing
to rewrite into an IfStmt mid-pass. Returned value is the prior
load, matching the contended-path semantics.
Coverage gaps deliberately left for follow-up (will be called out in
docs/source/user_guide/atomics.md in the next commit):
- f32 / f64 CAS (would need the same uint-bitcast path xchg uses; the
type_check carve-out raises a clean QuadrantsTypeError today).
- f16 CAS (same f16-deferral as f16 xchg).
- Workgroup-shared float CAS (same uint-backing requirement as shared
float xchg/add).
- i64 / u64 CAS on Metal (same hardware limitation as the rest of the
64-bit integer atomic family; the existing _skip_if_no_int64_atomic_rmw
helper will cover it once the test lands).
Tests + docs land in follow-up commits on this branch.
Five new tests parametrized on qd.gpu (i64 / u64 covered by the existing _skip_if_no_int64_atomic_rmw helper for Metal): - test_atomic_cas_returns_old_value -- single-thread sanity, success path AND failure path, all integer dtypes (i32 / u32 / i64 / u64). - test_atomic_cas_contention_single_winner -- N parallel threads each try to flip a slot from INIT to their own unique id; assert exactly one thread sees prior == INIT (the winner) and every loser sees prior == winner_id, and the slot ends with winner_id (no torn writes). The canonical "single CAS winner" invariant. - test_atomic_cas_loop_increment_matches_atomic_add -- user-built CAS retry loop produces the same final counter as atomic_add. Outer for is serialized (qd.loop_config(serialize=True)) so the bounded retry budget is robust; the point is to validate the user-facing CAS-loop pattern, not to stress-test convergence. - test_atomic_cas_demoted -- thread-local destination forces the demote pass to fire (parallel to test_atomic_exchange_demoted). Pins both legs: success-path swap fires, failure-path is a no-op, both return the prior load. - test_atomic_cas_on_float_field_raises -- pins the doc claim that f32 / f64 CAS raises QuadrantsCompilationError at trace time. Will flip to passing when the float CAS bitcast path lands. Doc: - New row in the support table for atomic_cas with the f32 / f64 carve-out (footnote §) and the same i64 / u64 Metal caveat (†) as the rest of the family. - New § footnote explaining that float CAS is rejected at trace time but trivially enableable via the same uint-bitcast trick xchg uses. - Updated the "There is no atomic_cas" paragraph -> dropped (CAS is now exposed) and replaced with a Semantics section showing the load + cmp + conditional-store transaction, the success-flag recovery pattern (`success = (old == expected)`), and a CAS-loop example for building atomic ops the framework doesn't expose natively.
Follow-on to the printer/carve-out cleanup on the parent xchg PR: * Adds the `atomic_cas` entry to `names_table` (slot 9) so dumping a CAS expression prints the right op name (and stays in-bounds w.r.t. the new `>=` guard). * Extends the visit method to also emit `expected=...` for cas, since cas is the only AtomicOpExpression with a 3rd operand. * Updates the `atomic_to_binary_op_type` carve-out comment to mention `cas` alongside `mul` and `xchg` -- cas takes (dest, expected, desired), so it has no plain binary form either.
Per review: don't promise/preview the wiring for f32 / f64 atomic_cas in the user guide; just document the current trace-time rejection and keep the rest of the integer-CAS support note unchanged.
"fully supported, and the i64 / u64 Metal caveat ..." was self- contradictory: if it's fully supported there's no caveat. Spell out that the support is per the per-backend table above, with the same Metal i64 / u64 caveat as the rest of the 64-bit integer family.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c2ac851e3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| auto expected_dtype = expected.get_rvalue_type(); | ||
| if (expected_dtype->is<TensorType>()) { | ||
| expected_dtype = expected_dtype.get_element_type(); | ||
| } |
There was a problem hiding this comment.
Reject or scalarize tensor CAS operands
When atomic_cas is called on a vector/matrix lvalue, this accepts tensor operands by peeling expected down to its element type, but the existing tensor scalarization paths only split dest and val and recreate AtomicOpStmt(stmt->op_type, ..., val_i) without carrying expected. The resulting scalar CAS statements have expected == nullptr, so the new CAS codegen/demotion assertions fire for inputs like qd.atomic_cas(vec_field[None], qd.Vector([0, 0]), qd.Vector([1, 1])). Until the scalarizers are updated to split and pass expected_i, tensor CAS should be rejected here.
Useful? React with 👍 / 👎.
| if (!is_integral(expected_dtype)) { | ||
| ErrorEmitter( | ||
| QuadrantsTypeError(), this, | ||
| fmt::format("'atomic_cas' requires integer 'expected' operand, got '{}'", expected->ret_type->to_string())); |
There was a problem hiding this comment.
Cast CAS expected to the destination type
This only checks that expected is integral, but it never casts it to the pointee type. irpass::type_check will cast val through type_check_store, while expected is passed directly to LLVM CreateAtomicCmpXchg / SPIR-V OpAtomicCompareExchange, both of which require the comparator to have the same type as the value in memory. Common documented calls such as qd.atomic_cas(i64_field[None], 0, 1) leave expected as the default i32, causing invalid IR/codegen instead of behaving like the other atomic RHS operands.
Useful? React with 👍 / 👎.
alanray-tech
left a comment
There was a problem hiding this comment.
Review
The two issues flagged by Codex are both real and must be fixed before merge:
1. Tensor CAS scalarization will crash (vector/matrix atomic_cas)
type_check in frontend_ir.cpp:982-984 accepts tensor-typed expected (peeling it to element type), but neither scalarize.cpp:531 nor lower_matrix_ptr.cpp:419 propagates expected when splitting atomic ops into per-component statements -- they both use the 3-arg AtomicOpStmt constructor which sets expected = nullptr. Any call like qd.atomic_cas(vec_field[None], Vector([0, 0]), Vector([1, 1])) will hit QD_ASSERT(stmt->expected != nullptr) in codegen/demotion and crash.
Fix: Either reject tensor destinations for CAS in type_check (smallest contact area), or update both scalarizers to split and thread expected_i into the 4-arg constructor. The former is safer for a first landing.
2. CAS expected operand is never type-cast to the pointee type
type_check validates that expected is integral but never inserts a cast to match the destination type. Meanwhile val gets implicit widening via type_check_store in irpass::type_check (type_check.cpp:74). A common call like qd.atomic_cas(i64_field[None], 0, 1) leaves expected as i32 while memory holds i64, producing invalid IR for both LLVM CreateAtomicCmpXchg and SPIR-V OpAtomicCompareExchange (both require all operands to have the same type as the value in memory).
Fix: Add a type_check_store-equivalent for stmt->expected in the AtomicOpStmt visitor in type_check.cpp, or cast expected alongside val in AtomicOpExpression::type_check.
A few additional suggestions are left as inline comments.
| } | ||
|
|
||
| AtomicOpExpression(AtomicOpType op_type, const Expr &dest, const Expr &expected, const Expr &val) | ||
| : op_type(op_type), dest(dest), val(val), expected(expected) { |
There was a problem hiding this comment.
Nit: the member initializer list order dest(dest), val(val), expected(expected) doesn't match the parameter order (dest, expected, val). C++ initializes in declaration order (dest, val, expected), so runtime behavior is correct, but the mismatch between parameter semantics and init order is a readability trap for maintainers. Consider either reordering the member declarations to dest, expected, val or swapping the parameter order to (dest, val, expected) to make the intent obvious.
| ir_->make_inst(spv::OpMemoryBarrier, ir_->const_i32_one_, | ||
| ir_->uint_immediate_number(ir_->u32_type(), spv::MemorySemanticsAcquireReleaseMask | | ||
| spv::MemorySemanticsUniformMemoryMask)); |
There was a problem hiding this comment.
This OpMemoryBarrier is inserted before the CAS but none of the other atomic ops (including xchg, atomic_add, atomic_min, etc.) have it. OpAtomicCompareExchange already provides its own memory semantics via the sem_eq / sem_neq operands (which are const_i32_zero_ here -- Relaxed). If the barrier is genuinely needed for CAS correctness, it should also be needed for the other atomic ops (they all use Relaxed semantics too). If it's not needed, it adds an unnecessary performance cost and an asymmetry that will confuse readers. Please either remove it or add a comment explaining why CAS specifically needs an extra barrier that no other atomic op does.
| if old == expected: | ||
| x[()] = desired | ||
| return old | ||
| if not (is_quadrants_expr(x) and x.ptr.is_lvalue()): |
There was a problem hiding this comment.
Minor: @writeback_binary (used by all other atomic ops) has an extra guard if isinstance(a, Field) or isinstance(b, Field): return NotImplemented that lets Python's operator protocol fall through cleanly. atomic_cas inlines the lvalue check but skips this Field guard. Since atomic_cas is a regular function call (not an operator), the NotImplemented path isn't strictly needed, but if someone passes a raw Field instead of field[None], they'll get a confusing AttributeError from x.ptr rather than the clear QuadrantsSyntaxError. Consider adding a similar early check or at least a guard like if isinstance(x, Field): raise QuadrantsSyntaxError(...).
# Conflicts: # docs/source/user_guide/atomics.md
Two related P1 fixes from PR #690 review (Codex + alanray-tech): 1. Reject tensor (vector / matrix) destinations explicitly in AtomicOpExpression::type_check. The other atomic ops fan out via scalarize / lower_matrix_ptr, but those passes use the 3-arg AtomicOpStmt constructor and would silently drop `expected`, tripping QD_ASSERT(stmt->expected) in codegen / demotion. Until the scalarizers grow a 4-arg path, refusing tensor CAS at trace time is the safest landing. 2. Cast `expected` to the destination element type in the same type_check carve-out. LLVM CreateAtomicCmpXchg and SPIR-V OpAtomicCompareExchange both require the comparator width to match the in-memory value; without this, a common call like `qd.atomic_cas(i64_field[None], 0, 1)` would leave `expected` at the i32 default and produce invalid IR. Mirrors how `val` gets widened in irpass::type_check::type_check_store, which doesn't see `expected`. Tests added: * test_atomic_cas_on_vector_field_raises - pins the tensor reject. * test_atomic_cas_expected_int_literal_widens_to_i64 - pins that plain Python int literals work as the comparator on i64.
…change Per PR #690 review (alanray-tech): no other atomic op in this file emits an OpMemoryBarrier alongside its OpAtomicI* / OpAtomicExchange, and OpAtomicCompareExchange already carries its own sem_eq / sem_neq operands (which we set to Relaxed, matching the rest of the file). The extra barrier was an asymmetric copy from elsewhere; removing it keeps the surrounding-memory ordering contract consistent with every other atomic - users pair atomics with qd.simt.block.mem_fence() / qd.simt.grid.mem_fence() if they need cross-location ordering.
Per PR #690 review (alanray-tech): the 4-arg constructor takes `(dest, expected, val)` but member declaration was `dest, val, expected`. Behavior was correct (C++ initializes in declaration order regardless), but the mismatch was a maintainer trap and clang-tidy normally flags this kind of init-list-out-of-order. Reorder the members to `dest, expected, val` so both the 4-arg ctor's init list and the parameter order read top-to-bottom in the same direction. The 3-arg ctor still leaves `expected` empty via Expr's default constructor.
Per PR #690 review (alanray-tech): @writeback_binary (used by every other atomic op) returns NotImplemented when called on a raw Field, which lets Python's operator protocol fall through cleanly and gives the user a clean error. atomic_cas takes 3 args so it doesn't use @writeback_binary; previously, passing a Field (rather than `field[None]`) would crash later with a confusing AttributeError on x.ptr. Add an explicit isinstance(x, Field) check that raises a clear QuadrantsSyntaxError pointing the user at the correct call shape. Test added: test_atomic_cas_raw_field_raises_clear_error.
Three new tests broadening atomic_cas coverage beyond the qd.field surface, in response to a follow-up review on PR #690: * test_atomic_cas_on_ndarray: pin that CAS works on qd.ndarray elements, not just qd.field. Ndarrays go through a distinct physical-pointer subscript path. Uses an N-thread contention kernel so atomicity gets exercised on the ndarray surface too. * test_atomic_cas_on_vector_field_scalar_component: pin that the documented workaround for the new tensor-CAS rejection -- extract a scalar with `vec_field[None][i]` and CAS on it -- actually works. Without this the docs would lie. * test_atomic_cas_expected_signed_int_literal_casts_to_unsigned: pin that the new `expected` cast in AtomicOpExpression::type_check handles signed -> unsigned conversion (i.e. -1 -> 0xFFFF...). Complements the i64 widening test which only covers positive same-sign widening.

Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough