feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620
feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620mvaligursky wants to merge 2 commits intomainfrom
Conversation
- ComputeRadixSort: 4-bit shared-memory multi-pass radix sort, WebGPU- only, runs on every device with no subgroup requirement. Selected as the portable fallback after benchmarking wider radixes (6-bit, 8-bit shared, 8-bit subgroup variants) on Apple M1/M2/M4, NVIDIA, and Android Mali/IMG; 4-bit is the single most consistent choice in every target range. - ComputeOneSweepRadixSort: single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder, ported from b0nes164/ GPUSorting. NVIDIA Turing+ only (gated by callers) -- requires forward-thread-progress guarantees that Apple Silicon and mobile Mali/Adreno do not provide reliably. - Radix-sort benchmark example (test/radix-sort-compute): compares the two backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional validation against a CPU reference implementation.
ef7dd6c to
f107d20
Compare
Leftover from an earlier iteration that had a scan-kernel factory with Blelloch and CSDLDF variants. PrefixSumKernel.constructor only takes device, so the object was silently ignored.
There was a problem hiding this comment.
Pull request overview
Adds an engine-internal WebGPU OneSweep radix sort backend (plus WGSL kernels), improves the existing portable 4-bit compute radix sort’s dispatch/scan sizing, and ships an interactive benchmark/validation example to compare the backends.
Changes:
- Add OneSweep WGSL kernels (global histogram, scan, fused binning/lookback).
- Add
ComputeOneSweepRadixSortimplementation and export it. - Update
ComputeRadixSortto decouple dispatch/scan size from buffer capacity and exposeradixBits; expand the example to benchmark/validate both modes.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-scan.js | New WGSL scan kernel for OneSweep histogram prefixes. |
| src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-global-hist.js | New WGSL fused global histogram kernel (vec4 loads). |
| src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-binning.js | New WGSL fused binning + lookback kernel (OneSweep core). |
| src/scene/graphics/compute-radix-sort.js | Dispatch/scan sizing fixed to follow current elementCount; add radixBits; rename GPU profiler pass labels. |
| src/scene/graphics/compute-onesweep-radix-sort.js | New OneSweep compute sort backend, shaders/formats, buffer management, dispatch flow. |
| src/index.js | Export ComputeOneSweepRadixSort. |
| examples/src/examples/test/radix-sort-compute.example.mjs | Add mode selection, overlays, benchmark + validation harnesses. |
| examples/src/examples/test/radix-sort-compute.controls.mjs | Add UI controls for mode/render/validation and benchmark/validate actions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // `MAX_SUBGROUPS` is the number of subgroups in a 256-thread | ||
| // workgroup at this device's subgroup size. Both the scan and | ||
| // binning kernels size their per-subgroup reduction slots by this | ||
| // value; hardcoding it to 8 (the sgSize=32 case) is incorrect on | ||
| // hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16 | ||
| // → 16 subgroups), so we specialize the shader per device here. | ||
| const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32))); |
There was a problem hiding this comment.
maxSubgroups is computed from device.maxSubgroupSize, but subgroup_size may be any value in [minSubgroupSize, maxSubgroupSize]. For workgroup-local arrays indexed by waveIndex = gtid / sgSize, you need to size for the minimum subgroup size (worst-case: most subgroups) or you can under-allocate and index out of bounds when the runtime subgroup size is smaller than the max, corrupting scan/binning results. Compute this from device.minSubgroupSize (with a safe fallback when it’s 0) and update the related shader comments accordingly.
| // `MAX_SUBGROUPS` is the number of subgroups in a 256-thread | |
| // workgroup at this device's subgroup size. Both the scan and | |
| // binning kernels size their per-subgroup reduction slots by this | |
| // value; hardcoding it to 8 (the sgSize=32 case) is incorrect on | |
| // hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16 | |
| // → 16 subgroups), so we specialize the shader per device here. | |
| const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32))); | |
| // `MAX_SUBGROUPS` is the worst-case number of subgroups in a | |
| // 256-thread workgroup. The runtime `subgroup_size` may be any | |
| // value in [minSubgroupSize, maxSubgroupSize], and the scan and | |
| // binning kernels size per-subgroup reduction slots indexed by | |
| // `waveIndex = gtid / sgSize`. To avoid under-allocating those | |
| // arrays when the runtime subgroup size is smaller than the max, | |
| // size this from the minimum subgroup size, with a safe fallback | |
| // for devices that do not report it. | |
| const minSubgroupSize = device.minSubgroupSize || device.maxSubgroupSize || 32; | |
| const maxSubgroups = Math.max(1, Math.ceil(256 / minSubgroupSize)); |
| */ | ||
| constructor(device) { | ||
| Debug.assert(device.supportsCompute, 'ComputeOneSweepRadixSort requires compute shader support (WebGPU)'); | ||
| Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support'); |
There was a problem hiding this comment.
The OneSweep binning WGSL uses 32-bit lane masks (e.g. subgroupBallot(...).x and (1u << sgInvId)), which is only valid when the runtime subgroup size is ≤ 32. On devices with 64/128-lane subgroups this becomes incorrect/UB and can silently corrupt output. Add a runtime guard (e.g. assert device.maxSubgroupSize <= 32 and device.minSubgroupSize > 0) or extend the shader to handle full 128-bit subgroup ballots.
| Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support'); | |
| Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support'); | |
| Debug.assert(device.minSubgroupSize > 0, 'ComputeOneSweepRadixSort requires a valid runtime subgroup size'); | |
| Debug.assert(device.maxSubgroupSize <= 32, 'ComputeOneSweepRadixSort currently requires subgroup sizes <= 32 because the OneSweep binning shader uses 32-bit subgroup masks'); |
| } | ||
|
|
||
| if (rowHasDetails) { | ||
| const detailIdx = detailRows.length; |
There was a problem hiding this comment.
The validation “Details” toggle index is inconsistent: the button uses data-toggle set to detailRows.length before pushing this size’s failing modes, but the hidden detail row uses data-detail-content set to detailRows.length - 1 after pushing. If a size has multiple failing modes, the button will point at a non-existent row and won’t open the details. Set the button’s data-toggle to the same id used by the inserted detail row (or compute a per-size id independent of detailRows.length).
| const detailIdx = detailRows.length; | |
| const failingCount = BENCH_CONFIGS.reduce((count, cfg) => { | |
| const entry = results.get(`${size}:${cfg.label}`); | |
| return count + ((entry && entry.failed > 0) ? 1 : 0); | |
| }, 0); | |
| const detailIdx = detailRows.length + failingCount - 1; |
| this._threadBlocks = currentThreadBlocks; | ||
|
|
||
| const maxPerDim = this.device.limits.maxComputeWorkgroupsPerDimension || 65535; | ||
| Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim); |
There was a problem hiding this comment.
DigitBinningPass assigns tiles via an atomic counter (b_index) and then uses partitionIndex to index b_passHist. This requires that the dispatch launches exactly threadBlocks workgroups. Using Compute.calcDispatchSize can pad x*y above threadBlocks when threadBlocks > maxComputeWorkgroupsPerDimension, causing extra workgroups to claim partitionIndex >= threadBlocks and then read/write b_passHist out of bounds. Consider (a) asserting threadBlocks <= maxPerDim and dispatching 1D, or (b) adding an early-exit in WGSL when partitionIndex >= uniforms.threadBlocks before any passHistOffset access.
| Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim); | |
| Debug.assert( | |
| currentThreadBlocks <= maxPerDim, | |
| `ComputeOneSweepRadixSort binning pass requires an exact 1D dispatch; threadBlocks (${currentThreadBlocks}) exceeds maxComputeWorkgroupsPerDimension (${maxPerDim}).` | |
| ); | |
| this._binningDispatchSize.set(currentThreadBlocks, 1); |

Summary
ComputeOneSweepRadixSort— single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder. Ported from b0nes164/GPUSorting (Thomas Smith, MIT). NVIDIA Turing+ only — requires forward-thread-progress guarantees that Apple Silicon, Mali and Adreno do not provide reliably. Not wired up in production yet.ComputeRadixSort(4-bit multi-pass) — decouples dispatch/scan size from buffer capacity so a small sort following a large one no longer pays the full capacity cost. Also adds aradixBitsgetter so callers can align key-bit counts generically across sort backends.test/radix-sort-compute) comparing both backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional CPU-reference validation.No public API changes
Both
ComputeRadixSortandComputeOneSweepRadixSortare tagged@ignore— they are engine-internal and not part of the published API. TheradixBitsgetter added toComputeRadixSortis likewise internal, used by the gsplat manager / benchmark to route key-bit counts between backends.Production routing
gsplat-manager.jscontinues to usenew ComputeRadixSort(device)unconditionally. A follow-up PR will add a thin facade that routes NVIDIA adapters toComputeOneSweepRadixSort. This PR is deliberately just "make the backend available + benchmark it".Dependencies
Depends on
device.minSubgroupSize/device.maxSubgroupSizeadded in #8645 (already on main).ComputeOneSweepRadixSortparametrises its shared-memory layout fromdevice.maxSubgroupSizeto support 16 / 32 / 64 / 128 lane subgroup hardware from the same source.Performance
Sort-only GPU time (Forward pass excluded), 24-bit keys, 20 warmup + 50 measured frames per cell.
OneSweep-safebelow is the CSDLDF-backed fallback that got cut.NVIDIA RTX 2070 (Turing) — OneSweep dominates across the full range (up to ~5× faster than 4-bit at 25M+).
8-coalescedis a respectable second but has no routing advantage over OneSweep.OneSweep-safe(CSDLDF) collapses badly past 5M — the reason the CSDLDF fallback was dropped.NVIDIA Quadro M2000 (Ampere) — same shape as Turing: OneSweep wins, CSDLDF-backed variant degrades sharply past a few million elements.
Apple M1 (metal-3) — 4-bit is the clear winner across the range that matters (4–8M target).
8-rankedregresses past 5M (0.53× at 20M) and8-coalescedis similar. OneSweep (plain and CSDLDF-safe) is non-competitive on Apple because lookback stalls are uncapped. This chart drove the "ship 4-bit as the single portable fallback" decision.Apple M4 (metal-3) — very different story:
8-rankedis 1.2–1.7× faster than 4-bit across every size and even beats OneSweep. M4 is not the Apple priority though (it already has a fast local-compute GSplat path), and since we ship one Apple fallback, M1/M2 win the tie-break. Leaving8-rankedin the branch history for a potential future M4-specific routing.Experiments that didn't make it
Several alternative radix-sort variants were explored before settling on "4-bit portable + OneSweep on NVIDIA". Listing them here so the reasoning isn't lost — the history is preserved on branch
mv-radix-8bit-subgroupup to commitbe5ffb304if anyone needs to resurrect one.CSDLDF (Chained-Scan Decoupled-Lookback Decoupled-Fallback) scan kernel, plus a dedicated scan-kernel factory that switched between CSDLDF and Blelloch. Intended to give OneSweep a portable fallback for devices without forward-thread-progress guarantees. On NVIDIA it validates correctly but serialises badly (tens to >100× slower than plain lookback at larger sizes). On Apple M4 it's slightly faster than plain OneSweep at 2–3M (10–20% win), but both OneSweep variants are dominated by 4-bit on Apple, so the theoretical advantage doesn't translate into a real win for any device in the current matrix. Removed for now — the algorithm itself is sound and may be worth resurrecting if a device shows up where plain lookback deadlocks and CSDLDF beats 4-bit.
8-bit shared-memory radix sort (no subgroup intrinsics). Wider radix → fewer passes, but the resulting 256-bucket bitmasks blow out shared memory enough that Mali hit a recursion cliff and M4 regressed vs 4-bit. Removed.
8-bit subgroup radix sort in four reorder variants (
ballot,packed,ranked,coalesced). Uses subgroup intrinsics for ranking. Genuinely strong on Apple M4 —rankedis 1.2–1.7× faster than 4-bit across every size and beats OneSweep too. However, M1/M2 are the Apple priority (M4 already has a fast local-compute GSplat path), and on M1/M2 the ranked variant regresses past 5M (0.53× at 20M). Since we only ship one Apple fallback, M1/M2 perf wins, so 4-bit is the safer pick.8-coalescedis also competitive on NVIDIA Ampere up to ~4M but OneSweep covers that range comfortably. Removed for simplicity — if an M4-specific (or desktop-class Apple) routing becomes worthwhile,rankedis the variant to revisit.6-bit shared-memory radix sort, with two optimisations on top (bank-conflict-friendly bitmask layout, shadow-copy bitmask reads). Attractive on paper as the middle ground between 4-bit and 8-bit. Apple M4 and Pixel 8 Pro looked promising, but Apple M1/M2 regressed vs 4-bit in the 4–8M target range (the range that actually matters for those devices — M1/M2 lacks the compute perf headroom of M4). Shadow-copy was reverted to test whether LDS-occupancy pressure explained the regression; the regression remained, so the cause is architectural rather than LDS pressure. Removed.
Net result: one portable backend (
ComputeRadixSort, 4-bit) that wins or is close-second on every target, plus one NVIDIA-specialised backend (ComputeOneSweepRadixSort, OneSweep) for where the hardware/driver can actually deliver on its lookback-based fusion.Platform coverage / benchmarks
Benchmarking was done on:
For the Unified GSplat manager's dynamic 10–20 bit key range,
ComputeRadixSortvalidates and performs consistently across the full platform matrix;ComputeOneSweepRadixSortvalidates and wins by a meaningful margin on NVIDIA but fails validation on the mobile SoCs tested.Test plan
ComputeRadixSortvalidates across full range; OneSweep should remain disabledComputeRadixSortvalidates across full rangegsplat-manager.jscontinue to sort correctly (still routes toComputeRadixSort)