fix: use DSB SY barrier and system-scope atomic load in fence for RDMA GPU visibility#3141
fix: use DSB SY barrier and system-scope atomic load in fence for RDMA GPU visibility#3141vskiwi wants to merge 1 commit intoml-explore:mainfrom
Conversation
…A GPU visibility On ARM64, std::memory_order_seq_cst compiles to DMB ISH (inner-shareable barrier), which only ensures ordering within CPU cores. The GPU sits outside this shareability domain and may never observe fence timestamp updates — causing Fence::wait to deadlock or race with Metal buffer deallocation (SIGABRT) when using JACCL distributed RDMA. CPU-side (fence.cpp): replace plain store with atomic store + DSB SY (full system barrier) to push the fence value to the point of coherence visible to all observers including the GPU and DMA engines. GPU-side (fence.metal): add a two-tier wait strategy — fast path with volatile reads (works when GPU cache is coherent), then fallback to __metal_atomic_load_explicit with system scope to force cache refresh from the System Level Cache. Verified on a 4x Mac Studio M3 Ultra 512 GB cluster (macOS 26.3, Thunderbolt 5 full-mesh RDMA): resolved stochastic Fence::wait deadlocks and SIGABRT crashes during multi-node tensor parallel inference. Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Cursor <cursoragent@cursor.com>
|
As a followup to our testing, this does not fix the issue in all scenarios, although it is considerably less frequent. We'll continue trying to figure out what causes these GPU locks. |
|
Can you say more about the conditions that cause the hang? Ideally a repro with mlx_lm would be great but if not at least more details on which model(s), prompt lengths, machines, frequency of hang, where it hangs (during loading, prefill, generation). |
|
@awni Here are the details you asked for. Reproduction conditions (via exo)The hang reproduces reliably through exo, which uses
What triggers itexo's tensor parallel loading does per-layer In contrast, Summary
The DSB SY patch from #3141 is still needed for workloads with frequent per-operation fence cycles over JACCL RDMA. We can investigate further to find the minimal mlx-lm pattern that reproduces the hang if that would be helpful. |
|
Update on our testing. We ran extensive tests on upstream main (with #3144) without the DSB SY patch from this PR, on a clean cluster (4× M3 Ultra 512 GB, macOS 26.3, JACCL RDMA, GLM-5 754B tensor parallel):
We were unable to reproduce the fence deadlock that we previously observed through exo with #3144 alone. Our earlier deadlock may have been caused by accumulated GPU/Metal state from prior test iterations rather than a clean-state issue. #3144 appears to be sufficient for our workload on a cleanly-booted cluster. We'll continue monitoring in production and report back if the deadlock recurs. @rltakashige — are you still seeing GPU locks with upstream main + #3144 on your setup? |
Problem
Fence::waitinfence.metalcan deadlock when CPU updates the fence timestamp in a distributed (multi-node) setup using JACCL over Thunderbolt RDMA.On ARM64,
std::memory_order_seq_cstcompiles toDMB ISH(Data Memory Barrier, Inner Shareable), which is only visible to CPU cores. The GPU sits outside the inner shareable domain and may never observe the updated fence timestamp — causingfence_waitto spin indefinitely.This affects any workload using
mlx.distributedwith JACCL (Thunderbolt RDMA) on Apple Silicon. Single-node workloads are not affected because CPU–GPU communication goes through unified memory without the cross-domain visibility issue that RDMA DMA introduces.Symptoms
On a 4× Mac Studio M3 Ultra 512 GB cluster (macOS 26.3, Thunderbolt 5 full-mesh RDMA):
Fence::wait→condition_variable::wait, 100% CPU spin on all nodesSIGABRTinmlx::core::gpu::check_error(MTL::CommandBuffer*)→abort()com.Metal.CompletionQueueDispatch(GPU command buffer error)MetalAllocator::clear_cache()→BufferCache::clear()→[AGXG15XFamilyBuffer dealloc]The crash is a race condition between GPU fence synchronization and Metal buffer lifecycle — the GPU reads a stale fence timestamp, proceeds with computation on a buffer that the CPU has already deallocated, and Metal's error checking triggers
SIGABRT.On macOS 26.2, the same race caused silent data corruption (AMCC interrupts visible in
dmesg). macOS 26.3 added an assert that converts it to an explicit crash.Note on the wired collector: @angeloskath mentioned that disabling the wired collector avoids deadlocks. This is consistent with our findings. The wired collector is the mechanism triggered by
mx.set_wired_limit(). When wired memory exceeds the limit, MLX reclaims Metal buffers viaMetalAllocator::clear_cache(). Our setup usesmx.set_wired_limit(max_recommended_working_set_size)— the standard pattern recommended by mlx-lm for large models. Without a wired limit, models like GLM-5 (767 GB) would cause excessive swap. Our crash stack confirms the connection: the collector deallocates a buffer while the GPU still holds a stale fence timestamp and continues computation on it. With the wired limit at default 0 (disabled), the collector doesn't run proactively and the race doesn't manifest — but this isn't viable for large distributed models. The DSB SY + system-scope atomic fix addresses the root cause.Root cause
In
fence.cpp,Fence::update()writes the fence counter:f.cpu_value()[0] = count;This plain store (or even
seq_cststore) compiles toSTLR+DMB ISHon ARM64. Per the ARM Architecture Reference Manual,DMB ISHensures ordering only within the Inner Shareable domain — i.e., CPU cores. The GPU and DMA engines are in the Full System domain.When RDMA DMA writes arrive in CPU-side memory, the fence value update is committed to L1/L2 cache but is not guaranteed to be visible to the GPU, which may be observing memory through a different cache hierarchy or through the System Level Cache (SLC).
In
fence.metal,fence_waitspins on a volatile read:If the GPU's cache holds a stale copy of
timestamp, this loop never terminates — even though the CPU has already written the updated value.Fix
Two changes, 24 lines total. Originally authored by @rltakashige — both the root cause analysis and the fix (CPU-side + GPU-side) in exo-explore/exo#1489 / #1515. We independently verified and tested on our cluster.
1.
fence.cpp— use DSB SY after fence storeDSB SY(Data Synchronization Barrier, Full System) ensures all preceding memory accesses complete (not just ordered) and are visible to all observers including the GPU and DMA engines. The0xFoperand specifies full system scope — per ARM ACLE §7.4.Why DSB SY specifically: the original
DMB ISHonly guarantees ordering within the Inner Shareable domain (CPU cores). Two properties are insufficient here — (1) DMB is an ordering barrier, not a completion barrier, and (2) ISH scope excludes the GPU.DSB SYfixes both: DSB ensures completion, SY ensures full system scope. Note:DSB ST(stores-only, full system) should theoretically suffice per the ARM spec, but was found insufficient on M3 Ultra during testing — likely due to Apple Silicon's implementation-specific cache coherence behavior between CPU and GPU.2.
fence.metal— add system-scope atomic load fallbackThe GPU-side fix adds a two-tier strategy:
atomic_thread_fence— works when GPU cache is coherent, zero overhead in the common case__metal_atomic_load_explicitwith__METAL_MEMORY_SCOPE_SYSTEM__— forces a load from the System Level Cache, bypassing GPU-local caches. This breaks through stale cache lines that the volatile path can't resolve.Uses
#pragma METAL internals : enable, consistent with how MLX already uses Metal internals in its fence code.Design note for reviewers: an alternative GPU-side approach would be to use
__metal_atomic_load_explicitwith system scope unconditionally (no fast path), which is simpler and provably correct. The two-tier strategy is a performance optimization — volatile reads are cheaper when the GPU cache is coherent, which is the common case. We defer to the MLX team's judgment on the right tradeoff here.Verification
Tested on 4× Mac Studio M3 Ultra 512 GB (macOS 26.3, Thunderbolt 5 full-mesh RDMA).
Model: GLM-5-8bit-MXFP8 (754B parameters, 767 GB, 78 layers, MXFP8 quantization) — tensor parallel across 4 nodes via JACCL RDMA, with
mx.set_wired_limit(max_recommended_working_set_size).Before fix: 4-node tensor parallel was unstable — stochastic deadlocks on short contexts and SIGABRT crashes on long contexts (50K+ tokens). On 4 attempts at 50K context, 3 resulted in
SIGABRTincheck_error(MTL::CommandBuffer*)within the first minute; the remaining nodes entered 100% CPU deadlock waiting on all-reduce. Short-context inference (< 10K) would intermittently hang inFence::wait.After fix (CPU-side DSB SY only, applied to MLX 0.30.6): All 6 inference runs (2× short context + 2× 10K context + 2× 50K context) on 4 nodes completed successfully — 0 crashes, 0 deadlocks.
Note: our cluster testing applied only the CPU-side fix (
fence.cpp). The GPU-side fix (fence.metal) was not included because the full fork is based on MLX mainline (newer than 0.30.6) and didn't compile against 0.30.6. The CPU-side fix alone resolved all observed deadlocks and crashes, but the GPU-side fix provides defense-in-depth against stale GPU cache reads.RDMA integrity stress test: We developed a dedicated 4-node stress test that exercises
all_sumandall_gatherwith varying tensor sizes (1 / 100 / 512 MB) and data patterns (ones / sequential / random), 3 rounds each — 54 collective operations total. This test reliably reproduced the deadlock before the fix. After applying DSB SY, all 54 operations passed with bit-perfect data integrity.Performance impact: negligible. @rltakashige measured 267 vs 269 tok/s on Llama 3.2 1B (~10K context) — within noise.
Binary verification:
otool -tv libmlx.dylibconfirmsdsb syinstruction immediately afterstlr(store-release) in the fence update path on all 4 nodes.References
Environment
Made with Cursor