Skip to content

Comments

fix: use DSB SY barrier and system-scope atomic load in fence for RDMA GPU visibility#3141

Open
vskiwi wants to merge 1 commit intoml-explore:mainfrom
vskiwi:fix/fence-dsb-sy-rdma
Open

fix: use DSB SY barrier and system-scope atomic load in fence for RDMA GPU visibility#3141
vskiwi wants to merge 1 commit intoml-explore:mainfrom
vskiwi:fix/fence-dsb-sy-rdma

Conversation

@vskiwi
Copy link
Contributor

@vskiwi vskiwi commented Feb 18, 2026

Problem

Fence::wait in fence.metal can deadlock when CPU updates the fence timestamp in a distributed (multi-node) setup using JACCL over Thunderbolt RDMA.

On ARM64, std::memory_order_seq_cst compiles to DMB ISH (Data Memory Barrier, Inner Shareable), which is only visible to CPU cores. The GPU sits outside the inner shareable domain and may never observe the updated fence timestamp — causing fence_wait to spin indefinitely.

This affects any workload using mlx.distributed with JACCL (Thunderbolt RDMA) on Apple Silicon. Single-node workloads are not affected because CPU–GPU communication goes through unified memory without the cross-domain visibility issue that RDMA DMA introduces.

Symptoms

On a 4× Mac Studio M3 Ultra 512 GB cluster (macOS 26.3, Thunderbolt 5 full-mesh RDMA):

  • Short contexts (< 10K tokens): stochastic deadlock — Fence::waitcondition_variable::wait, 100% CPU spin on all nodes
  • Long contexts (50K+ tokens): guaranteed crash — SIGABRT in mlx::core::gpu::check_error(MTL::CommandBuffer*)abort()
  • Faulting thread: com.Metal.CompletionQueueDispatch (GPU command buffer error)
  • Main thread at crash: MetalAllocator::clear_cache()BufferCache::clear()[AGXG15XFamilyBuffer dealloc]
  • Remaining nodes: 100% CPU deadlock (all-reduce waiting for the crashed peer)

The crash is a race condition between GPU fence synchronization and Metal buffer lifecycle — the GPU reads a stale fence timestamp, proceeds with computation on a buffer that the CPU has already deallocated, and Metal's error checking triggers SIGABRT.

On macOS 26.2, the same race caused silent data corruption (AMCC interrupts visible in dmesg). macOS 26.3 added an assert that converts it to an explicit crash.

Note on the wired collector: @angeloskath mentioned that disabling the wired collector avoids deadlocks. This is consistent with our findings. The wired collector is the mechanism triggered by mx.set_wired_limit(). When wired memory exceeds the limit, MLX reclaims Metal buffers via MetalAllocator::clear_cache(). Our setup uses mx.set_wired_limit(max_recommended_working_set_size) — the standard pattern recommended by mlx-lm for large models. Without a wired limit, models like GLM-5 (767 GB) would cause excessive swap. Our crash stack confirms the connection: the collector deallocates a buffer while the GPU still holds a stale fence timestamp and continues computation on it. With the wired limit at default 0 (disabled), the collector doesn't run proactively and the race doesn't manifest — but this isn't viable for large distributed models. The DSB SY + system-scope atomic fix addresses the root cause.

Root cause

In fence.cpp, Fence::update() writes the fence counter:

f.cpu_value()[0] = count;

This plain store (or even seq_cst store) compiles to STLR + DMB ISH on ARM64. Per the ARM Architecture Reference Manual, DMB ISH ensures ordering only within the Inner Shareable domain — i.e., CPU cores. The GPU and DMA engines are in the Full System domain.

When RDMA DMA writes arrive in CPU-side memory, the fence value update is committed to L1/L2 cache but is not guaranteed to be visible to the GPU, which may be observing memory through a different cache hierarchy or through the System Level Cache (SLC).

In fence.metal, fence_wait spins on a volatile read:

while (1) {
    atomic_thread_fence(mem_device, seq_cst, thread_scope_system);
    if (timestamp[0] >= value) break;
}

If the GPU's cache holds a stale copy of timestamp, this loop never terminates — even though the CPU has already written the updated value.

Fix

Two changes, 24 lines total. Originally authored by @rltakashige — both the root cause analysis and the fix (CPU-side + GPU-side) in exo-explore/exo#1489 / #1515. We independently verified and tested on our cluster.

1. fence.cpp — use DSB SY after fence store

DSB SY (Data Synchronization Barrier, Full System) ensures all preceding memory accesses complete (not just ordered) and are visible to all observers including the GPU and DMA engines. The 0xF operand specifies full system scope — per ARM ACLE §7.4.

Why DSB SY specifically: the original DMB ISH only guarantees ordering within the Inner Shareable domain (CPU cores). Two properties are insufficient here — (1) DMB is an ordering barrier, not a completion barrier, and (2) ISH scope excludes the GPU. DSB SY fixes both: DSB ensures completion, SY ensures full system scope. Note: DSB ST (stores-only, full system) should theoretically suffice per the ARM spec, but was found insufficient on M3 Ultra during testing — likely due to Apple Silicon's implementation-specific cache coherence behavior between CPU and GPU.

2. fence.metal — add system-scope atomic load fallback

The GPU-side fix adds a two-tier strategy:

  1. Fast path (up to 1M iterations): volatile reads + atomic_thread_fence — works when GPU cache is coherent, zero overhead in the common case
  2. Fallback: __metal_atomic_load_explicit with __METAL_MEMORY_SCOPE_SYSTEM__ — forces a load from the System Level Cache, bypassing GPU-local caches. This breaks through stale cache lines that the volatile path can't resolve.

Uses #pragma METAL internals : enable, consistent with how MLX already uses Metal internals in its fence code.

Design note for reviewers: an alternative GPU-side approach would be to use __metal_atomic_load_explicit with system scope unconditionally (no fast path), which is simpler and provably correct. The two-tier strategy is a performance optimization — volatile reads are cheaper when the GPU cache is coherent, which is the common case. We defer to the MLX team's judgment on the right tradeoff here.

Verification

Tested on 4× Mac Studio M3 Ultra 512 GB (macOS 26.3, Thunderbolt 5 full-mesh RDMA).

Model: GLM-5-8bit-MXFP8 (754B parameters, 767 GB, 78 layers, MXFP8 quantization) — tensor parallel across 4 nodes via JACCL RDMA, with mx.set_wired_limit(max_recommended_working_set_size).

Before fix: 4-node tensor parallel was unstable — stochastic deadlocks on short contexts and SIGABRT crashes on long contexts (50K+ tokens). On 4 attempts at 50K context, 3 resulted in SIGABRT in check_error(MTL::CommandBuffer*) within the first minute; the remaining nodes entered 100% CPU deadlock waiting on all-reduce. Short-context inference (< 10K) would intermittently hang in Fence::wait.

After fix (CPU-side DSB SY only, applied to MLX 0.30.6): All 6 inference runs (2× short context + 2× 10K context + 2× 50K context) on 4 nodes completed successfully — 0 crashes, 0 deadlocks.

Note: our cluster testing applied only the CPU-side fix (fence.cpp). The GPU-side fix (fence.metal) was not included because the full fork is based on MLX mainline (newer than 0.30.6) and didn't compile against 0.30.6. The CPU-side fix alone resolved all observed deadlocks and crashes, but the GPU-side fix provides defense-in-depth against stale GPU cache reads.

RDMA integrity stress test: We developed a dedicated 4-node stress test that exercises all_sum and all_gather with varying tensor sizes (1 / 100 / 512 MB) and data patterns (ones / sequential / random), 3 rounds each — 54 collective operations total. This test reliably reproduced the deadlock before the fix. After applying DSB SY, all 54 operations passed with bit-perfect data integrity.

Performance impact: negligible. @rltakashige measured 267 vs 269 tok/s on Llama 3.2 1B (~10K context) — within noise.

Binary verification: otool -tv libmlx.dylib confirms dsb sy instruction immediately after stlr (store-release) in the fence update path on all 4 nodes.

References

Environment

  • macOS 26.3 (25D125)
  • Apple M3 Ultra (4 machines, 512 GB each)
  • Thunderbolt 5 full-mesh RDMA (JACCL backend)
  • MLX 0.30.6 (with DSB SY patch applied to fence.cpp)
  • Xcode 26.2, Metal Toolchain 17C48

Made with Cursor

…A GPU visibility

On ARM64, std::memory_order_seq_cst compiles to DMB ISH (inner-shareable
barrier), which only ensures ordering within CPU cores. The GPU sits
outside this shareability domain and may never observe fence timestamp
updates — causing Fence::wait to deadlock or race with Metal buffer
deallocation (SIGABRT) when using JACCL distributed RDMA.

CPU-side (fence.cpp): replace plain store with atomic store + DSB SY
(full system barrier) to push the fence value to the point of coherence
visible to all observers including the GPU and DMA engines.

GPU-side (fence.metal): add a two-tier wait strategy — fast path with
volatile reads (works when GPU cache is coherent), then fallback to
__metal_atomic_load_explicit with system scope to force cache refresh
from the System Level Cache.

Verified on a 4x Mac Studio M3 Ultra 512 GB cluster (macOS 26.3,
Thunderbolt 5 full-mesh RDMA): resolved stochastic Fence::wait
deadlocks and SIGABRT crashes during multi-node tensor parallel
inference.

Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Cursor <cursoragent@cursor.com>
@rltakashige
Copy link
Contributor

rltakashige commented Feb 18, 2026

As a followup to our testing, this does not fix the issue in all scenarios, although it is considerably less frequent. We'll continue trying to figure out what causes these GPU locks.

@awni
Copy link
Member

awni commented Feb 19, 2026

Can you say more about the conditions that cause the hang? Ideally a repro with mlx_lm would be great but if not at least more details on which model(s), prompt lengths, machines, frequency of hang, where it hangs (during loading, prefill, generation).

@vskiwi
Copy link
Contributor Author

vskiwi commented Feb 19, 2026

@awni Here are the details you asked for.

Reproduction conditions (via exo)

The hang reproduces reliably through exo, which uses mlx.distributed with JACCL for tensor parallel inference.

  • Model: GLM-5-8bit-MXFP8 (754B, 767 GB, 78 MoE layers)
  • Machines: 4× Mac Studio M3 Ultra 512 GB, macOS 26.3, Thunderbolt 5 full-mesh RDMA (JACCL)
  • Prompt lengths: any — deadlock on first inference request with short prompts (<100 tokens)
  • Where it hangs: during model weight loading (per-layer mx.eval with distributed sharding) — before any prompt processing begins
  • Frequency: deterministic on first request with upstream main + Fix fence synchronization accross command buffers #3144 (no DSB SY). Adding __dsb(0xF) resolves it — 0 hangs across all tests (100 / 10K / 50K / 93K tokens)

What triggers it

exo's tensor parallel loading does per-layer eval_with_timeout — it calls mx.eval(layer.parameters()) on each of 78 layers individually during sharding. Each eval produces a separate command buffer commit with fence synchronization via JACCL RDMA. This creates ~78 rapid fence update/wait cycles during loading alone.

In contrast, mlx_lm.sharded_load() does a single mx.eval(model.parameters()) after sharding, which produces far fewer fence transitions. We tested upstream main + #3144 (without DSB SY) through mlx_lm.sharded_load on the same cluster and model — GLM-5 at 50K tokens completed without hang (141 tok/s prefill, 14.8 tok/s decode). The per-layer eval pattern in exo appears to be what exposes the fence visibility issue.

Summary

Configuration exo (per-layer eval) mlx-lm (bulk eval)
main + #3144 only Deadlock on first request Pass (50K tokens)
main + #3144 + DSB SY Pass (all tests up to 93K tokens) Pass

The DSB SY patch from #3141 is still needed for workloads with frequent per-operation fence cycles over JACCL RDMA. We can investigate further to find the minimal mlx-lm pattern that reproduces the hang if that would be helpful.

@vskiwi
Copy link
Contributor Author

vskiwi commented Feb 19, 2026

Update on our testing. We ran extensive tests on upstream main (with #3144) without the DSB SY patch from this PR, on a clean cluster (4× M3 Ultra 512 GB, macOS 26.3, JACCL RDMA, GLM-5 754B tensor parallel):

  • mlx-lm sharded_load + generate: 50K tokens — pass
  • exo (full stack): 66K tokens — pass after clean reboot, twice

We were unable to reproduce the fence deadlock that we previously observed through exo with #3144 alone. Our earlier deadlock may have been caused by accumulated GPU/Metal state from prior test iterations rather than a clean-state issue.

#3144 appears to be sufficient for our workload on a cleanly-booted cluster. We'll continue monitoring in production and report back if the deadlock recurs.

@rltakashige — are you still seeing GPU locks with upstream main + #3144 on your setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants