feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA) by mvaligursky · Pull Request #8620 · playcanvas/engine

mvaligursky · 2026-04-20T08:40:17Z

Summary

New: ComputeOneSweepRadixSort — single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder. Ported from b0nes164/GPUSorting (Thomas Smith, MIT). NVIDIA Turing+ only — requires forward-thread-progress guarantees that Apple Silicon, Mali and Adreno do not provide reliably. Not wired up in production yet.
Improved: existing ComputeRadixSort (4-bit multi-pass) — decouples dispatch/scan size from buffer capacity so a small sort following a large one no longer pays the full capacity cost. Also adds a radixBits getter so callers can align key-bit counts generically across sort backends.
New: interactive benchmark example (test/radix-sort-compute) comparing both backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional CPU-reference validation.

No public API changes

Both ComputeRadixSort and ComputeOneSweepRadixSort are tagged @ignore — they are engine-internal and not part of the published API. The radixBits getter added to ComputeRadixSort is likewise internal, used by the gsplat manager / benchmark to route key-bit counts between backends.

Production routing

gsplat-manager.js continues to use new ComputeRadixSort(device) unconditionally. A follow-up PR will add a thin facade that routes NVIDIA adapters to ComputeOneSweepRadixSort. This PR is deliberately just "make the backend available + benchmark it".

Dependencies

Depends on device.minSubgroupSize / device.maxSubgroupSize added in #8645 (already on main). ComputeOneSweepRadixSort parametrises its shared-memory layout from device.maxSubgroupSize to support 16 / 32 / 64 / 128 lane subgroup hardware from the same source.

Performance

Sort-only GPU time (Forward pass excluded), 24-bit keys, 20 warmup + 50 measured frames per cell. OneSweep-safe below is the CSDLDF-backed fallback that got cut.

NVIDIA RTX 2070 (Turing) — OneSweep dominates across the full range (up to ~5× faster than 4-bit at 25M+). 8-coalesced is a respectable second but has no routing advantage over OneSweep. OneSweep-safe (CSDLDF) collapses badly past 5M — the reason the CSDLDF fallback was dropped.

NVIDIA Quadro M2000 (Ampere) — same shape as Turing: OneSweep wins, CSDLDF-backed variant degrades sharply past a few million elements.

Apple M1 (metal-3) — 4-bit is the clear winner across the range that matters (4–8M target). 8-ranked regresses past 5M (0.53× at 20M) and 8-coalesced is similar. OneSweep (plain and CSDLDF-safe) is non-competitive on Apple because lookback stalls are uncapped. This chart drove the "ship 4-bit as the single portable fallback" decision.

Apple M4 (metal-3) — very different story: 8-ranked is 1.2–1.7× faster than 4-bit across every size and even beats OneSweep. M4 is not the Apple priority though (it already has a fast local-compute GSplat path), and since we ship one Apple fallback, M1/M2 win the tie-break. Leaving 8-ranked in the branch history for a potential future M4-specific routing.

Experiments that didn't make it

Several alternative radix-sort variants were explored before settling on "4-bit portable + OneSweep on NVIDIA". Listing them here so the reasoning isn't lost — the history is preserved on branch mv-radix-8bit-subgroup up to commit be5ffb304 if anyone needs to resurrect one.

CSDLDF (Chained-Scan Decoupled-Lookback Decoupled-Fallback) scan kernel, plus a dedicated scan-kernel factory that switched between CSDLDF and Blelloch. Intended to give OneSweep a portable fallback for devices without forward-thread-progress guarantees. On NVIDIA it validates correctly but serialises badly (tens to >100× slower than plain lookback at larger sizes). On Apple M4 it's slightly faster than plain OneSweep at 2–3M (10–20% win), but both OneSweep variants are dominated by 4-bit on Apple, so the theoretical advantage doesn't translate into a real win for any device in the current matrix. Removed for now — the algorithm itself is sound and may be worth resurrecting if a device shows up where plain lookback deadlocks and CSDLDF beats 4-bit.
8-bit shared-memory radix sort (no subgroup intrinsics). Wider radix → fewer passes, but the resulting 256-bucket bitmasks blow out shared memory enough that Mali hit a recursion cliff and M4 regressed vs 4-bit. Removed.
8-bit subgroup radix sort in four reorder variants (ballot, packed, ranked, coalesced). Uses subgroup intrinsics for ranking. Genuinely strong on Apple M4 — ranked is 1.2–1.7× faster than 4-bit across every size and beats OneSweep too. However, M1/M2 are the Apple priority (M4 already has a fast local-compute GSplat path), and on M1/M2 the ranked variant regresses past 5M (0.53× at 20M). Since we only ship one Apple fallback, M1/M2 perf wins, so 4-bit is the safer pick. 8-coalesced is also competitive on NVIDIA Ampere up to ~4M but OneSweep covers that range comfortably. Removed for simplicity — if an M4-specific (or desktop-class Apple) routing becomes worthwhile, ranked is the variant to revisit.
6-bit shared-memory radix sort, with two optimisations on top (bank-conflict-friendly bitmask layout, shadow-copy bitmask reads). Attractive on paper as the middle ground between 4-bit and 8-bit. Apple M4 and Pixel 8 Pro looked promising, but Apple M1/M2 regressed vs 4-bit in the 4–8M target range (the range that actually matters for those devices — M1/M2 lacks the compute perf headroom of M4). Shadow-copy was reverted to test whether LDS-occupancy pressure explained the regression; the regression remained, so the cause is architectural rather than LDS pressure. Removed.

Net result: one portable backend (ComputeRadixSort, 4-bit) that wins or is close-second on every target, plus one NVIDIA-specialised backend (ComputeOneSweepRadixSort, OneSweep) for where the hardware/driver can actually deliver on its lookback-based fusion.

Platform coverage / benchmarks

Benchmarking was done on:

NVIDIA RTX 2070 (Turing), NVIDIA Quadro M2000 (Ampere)
Apple M1, M2, M4
iPhone 13 Pro (A15)
Android Pixel 8 Pro (Tensor G3 / Mali-G715 Valhall)
Android Pixel 10 Pro (Tensor G5 / Imagination IMG)

For the Unified GSplat manager's dynamic 10–20 bit key range, ComputeRadixSort validates and performs consistently across the full platform matrix; ComputeOneSweepRadixSort validates and wins by a meaningful margin on NVIDIA but fails validation on the mobile SoCs tested.

Test plan

Run benchmark on NVIDIA (RTX 2070+): both backends validate across 1K–50M elements and 4/8/16/24/32-bit widths
Run benchmark on Apple M1/M2/M4: ComputeRadixSort validates across full range; OneSweep should remain disabled
Run benchmark on Android (Pixel 8/10 Pro): ComputeRadixSort validates across full range
Existing GSplat scenes using gsplat-manager.js continue to sort correctly (still routes to ComputeRadixSort)

LeXXik · 2026-04-20T08:47:57Z

3090:

Note: WebGPU GPU Sort step breaks here - black screen.

- ComputeRadixSort: 4-bit shared-memory multi-pass radix sort, WebGPU- only, runs on every device with no subgroup requirement. Selected as the portable fallback after benchmarking wider radixes (6-bit, 8-bit shared, 8-bit subgroup variants) on Apple M1/M2/M4, NVIDIA, and Android Mali/IMG; 4-bit is the single most consistent choice in every target range. - ComputeOneSweepRadixSort: single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder, ported from b0nes164/ GPUSorting. NVIDIA Turing+ only (gated by callers) -- requires forward-thread-progress guarantees that Apple Silicon and mobile Mali/Adreno do not provide reliably. - Radix-sort benchmark example (test/radix-sort-compute): compares the two backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional validation against a CPU reference implementation.

Leftover from an earlier iteration that had a scan-kernel factory with Blelloch and CSDLDF variants. PrefixSumKernel.constructor only takes device, so the object was silently ignored.

Copilot

Pull request overview

Adds an engine-internal WebGPU OneSweep radix sort backend (plus WGSL kernels), improves the existing portable 4-bit compute radix sort’s dispatch/scan sizing, and ships an interactive benchmark/validation example to compare the backends.

Changes:

Add OneSweep WGSL kernels (global histogram, scan, fused binning/lookback).
Add ComputeOneSweepRadixSort implementation and export it.
Update ComputeRadixSort to decouple dispatch/scan size from buffer capacity and expose radixBits; expand the example to benchmark/validate both modes.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-scan.js	New WGSL scan kernel for OneSweep histogram prefixes.
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-global-hist.js	New WGSL fused global histogram kernel (vec4 loads).
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-binning.js	New WGSL fused binning + lookback kernel (OneSweep core).
src/scene/graphics/compute-radix-sort.js	Dispatch/scan sizing fixed to follow current elementCount; add `radixBits`; rename GPU profiler pass labels.
src/scene/graphics/compute-onesweep-radix-sort.js	New OneSweep compute sort backend, shaders/formats, buffer management, dispatch flow.
src/index.js	Export `ComputeOneSweepRadixSort`.
examples/src/examples/test/radix-sort-compute.example.mjs	Add mode selection, overlays, benchmark + validation harnesses.
examples/src/examples/test/radix-sort-compute.controls.mjs	Add UI controls for mode/render/validation and benchmark/validate actions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T17:12:53Z

+        // `MAX_SUBGROUPS` is the number of subgroups in a 256-thread
+        // workgroup at this device's subgroup size. Both the scan and
+        // binning kernels size their per-subgroup reduction slots by this
+        // value; hardcoding it to 8 (the sgSize=32 case) is incorrect on
+        // hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16
+        // → 16 subgroups), so we specialize the shader per device here.
+        const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32)));


maxSubgroups is computed from device.maxSubgroupSize, but subgroup_size may be any value in [minSubgroupSize, maxSubgroupSize]. For workgroup-local arrays indexed by waveIndex = gtid / sgSize, you need to size for the minimum subgroup size (worst-case: most subgroups) or you can under-allocate and index out of bounds when the runtime subgroup size is smaller than the max, corrupting scan/binning results. Compute this from device.minSubgroupSize (with a safe fallback when it’s 0) and update the related shader comments accordingly.

Suggested change

// `MAX_SUBGROUPS` is the number of subgroups in a 256-thread

// workgroup at this device's subgroup size. Both the scan and

// binning kernels size their per-subgroup reduction slots by this

// value; hardcoding it to 8 (the sgSize=32 case) is incorrect on

// hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16

// → 16 subgroups), so we specialize the shader per device here.

const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32)));

// `MAX_SUBGROUPS` is the worst-case number of subgroups in a

// 256-thread workgroup. The runtime `subgroup_size` may be any

// value in [minSubgroupSize, maxSubgroupSize], and the scan and

// binning kernels size per-subgroup reduction slots indexed by

// `waveIndex = gtid / sgSize`. To avoid under-allocating those

// arrays when the runtime subgroup size is smaller than the max,

// size this from the minimum subgroup size, with a safe fallback

// for devices that do not report it.

const minSubgroupSize = device.minSubgroupSize || device.maxSubgroupSize || 32;

const maxSubgroups = Math.max(1, Math.ceil(256 / minSubgroupSize));

Copilot · 2026-04-23T17:12:54Z

+     */
+    constructor(device) {
+        Debug.assert(device.supportsCompute, 'ComputeOneSweepRadixSort requires compute shader support (WebGPU)');
+        Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support');


The OneSweep binning WGSL uses 32-bit lane masks (e.g. subgroupBallot(...).x and (1u << sgInvId)), which is only valid when the runtime subgroup size is ≤ 32. On devices with 64/128-lane subgroups this becomes incorrect/UB and can silently corrupt output. Add a runtime guard (e.g. assert device.maxSubgroupSize <= 32 and device.minSubgroupSize > 0) or extend the shader to handle full 128-bit subgroup ballots.

Suggested change

Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support');

Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support');

Debug.assert(device.minSubgroupSize > 0, 'ComputeOneSweepRadixSort requires a valid runtime subgroup size');

Debug.assert(device.maxSubgroupSize <= 32, 'ComputeOneSweepRadixSort currently requires subgroup sizes <= 32 because the OneSweep binning shader uses 32-bit subgroup masks');

Copilot · 2026-04-23T17:12:54Z

+        }
+
+        if (rowHasDetails) {
+            const detailIdx = detailRows.length;


The validation “Details” toggle index is inconsistent: the button uses data-toggle set to detailRows.length before pushing this size’s failing modes, but the hidden detail row uses data-detail-content set to detailRows.length - 1 after pushing. If a size has multiple failing modes, the button will point at a non-existent row and won’t open the details. Set the button’s data-toggle to the same id used by the inserted detail row (or compute a per-size id independent of detailRows.length).

Suggested change

const detailIdx = detailRows.length;

const failingCount = BENCH_CONFIGS.reduce((count, cfg) => {

const entry = results.get(`${size}:${cfg.label}`);

return count + ((entry && entry.failed > 0) ? 1 : 0);

}, 0);

const detailIdx = detailRows.length + failingCount - 1;

Copilot · 2026-04-23T17:12:54Z

+        this._threadBlocks = currentThreadBlocks;
+
+        const maxPerDim = this.device.limits.maxComputeWorkgroupsPerDimension || 65535;
+        Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim);


DigitBinningPass assigns tiles via an atomic counter (b_index) and then uses partitionIndex to index b_passHist. This requires that the dispatch launches exactly threadBlocks workgroups. Using Compute.calcDispatchSize can pad x*y above threadBlocks when threadBlocks > maxComputeWorkgroupsPerDimension, causing extra workgroups to claim partitionIndex >= threadBlocks and then read/write b_passHist out of bounds. Consider (a) asserting threadBlocks <= maxPerDim and dispatching 1D, or (b) adding an early-exit in WGSL when partitionIndex >= uniforms.threadBlocks before any passHistOffset access.

Suggested change

Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim);

Debug.assert(

currentThreadBlocks <= maxPerDim,

`ComputeOneSweepRadixSort binning pass requires an exact 1D dispatch; threadBlocks (${currentThreadBlocks}) exceeds maxComputeWorkgroupsPerDimension (${maxPerDim}).`

);

this._binningDispatchSize.set(currentThreadBlocks, 1);

mvaligursky self-assigned this Apr 20, 2026

vercel Bot deployed to Preview – engine April 20, 2026 08:40 View deployment

vercel Bot deployed to Preview – engine-api-docs April 20, 2026 12:16 View deployment

vercel Bot deployed to Preview – engine April 20, 2026 12:17 View deployment

vercel Bot deployed to Preview – engine-api-docs April 21, 2026 08:23 View deployment

vercel Bot deployed to Preview – engine April 21, 2026 08:24 View deployment

vercel Bot deployed to Preview – engine-api-docs April 21, 2026 15:49 View deployment

vercel Bot deployed to Preview – engine April 21, 2026 15:50 View deployment

vercel Bot deployed to Preview – engine-api-docs April 21, 2026 15:55 View deployment

vercel Bot deployed to Preview – engine April 21, 2026 15:56 View deployment

vercel Bot deployed to Preview – engine-api-docs April 21, 2026 16:03 View deployment

vercel Bot deployed to Preview – engine April 21, 2026 16:04 View deployment

vercel Bot deployed to Preview – engine-api-docs April 21, 2026 18:34 View deployment

vercel Bot deployed to Preview – engine April 21, 2026 18:35 View deployment

vercel Bot deployed to Preview – engine-api-docs April 22, 2026 11:24 View deployment

vercel Bot deployed to Preview – engine April 22, 2026 11:24 View deployment

vercel Bot deployed to Preview – engine-api-docs April 22, 2026 15:27 View deployment

vercel Bot deployed to Preview – engine April 22, 2026 15:28 View deployment

vercel Bot deployed to Preview – engine-api-docs April 23, 2026 08:58 View deployment

vercel Bot deployed to Preview – engine April 23, 2026 08:58 View deployment

vercel Bot deployed to Preview – engine-api-docs April 23, 2026 09:52 View deployment

vercel Bot deployed to Preview – engine April 23, 2026 09:53 View deployment

vercel Bot deployed to Preview – engine-api-docs April 23, 2026 14:55 View deployment

vercel Bot deployed to Preview – engine April 23, 2026 14:56 View deployment

vercel Bot deployed to Preview – engine-api-docs April 23, 2026 15:11 View deployment

vercel Bot deployed to Preview – engine April 23, 2026 15:12 View deployment

mvaligursky force-pushed the mv-radix-8bit-subgroup branch from ef7dd6c to f107d20 Compare April 23, 2026 16:55

vercel Bot deployed to Preview – engine-api-docs April 23, 2026 16:56 View deployment

mvaligursky changed the title ~~feat: experimental 8-bit radix sort with CSDLDF scan + subgroup variants~~ feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA) Apr 23, 2026

vercel Bot deployed to Preview – engine April 23, 2026 16:57 View deployment

chore: drop unused PrefixSumKernel options arg

91bf1aa

Leftover from an earlier iteration that had a scan-kernel factory with Blelloch and CSDLDF variants. PrefixSumKernel.constructor only takes device, so the object was silently ignored.

vercel Bot deployed to Preview – engine-api-docs April 23, 2026 17:05 View deployment

mvaligursky requested a review from Copilot April 23, 2026 17:06

vercel Bot deployed to Preview – engine April 23, 2026 17:06 View deployment

Copilot started reviewing on behalf of mvaligursky April 23, 2026 17:06 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620

feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620
mvaligursky wants to merge 2 commits intomainfrom
mv-radix-8bit-subgroup

mvaligursky commented Apr 20, 2026 •

edited

Loading

Uh oh!

LeXXik commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        // `MAX_SUBGROUPS` is the number of subgroups in a 256-thread
-        // workgroup at this device's subgroup size. Both the scan and
-        // binning kernels size their per-subgroup reduction slots by this
-        // value; hardcoding it to 8 (the sgSize=32 case) is incorrect on
-        // hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16
-        // → 16 subgroups), so we specialize the shader per device here.
-        const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32)));
+        // `MAX_SUBGROUPS` is the worst-case number of subgroups in a
+        // 256-thread workgroup. The runtime `subgroup_size` may be any
+        // value in [minSubgroupSize, maxSubgroupSize], and the scan and
+        // binning kernels size per-subgroup reduction slots indexed by
+        // `waveIndex = gtid / sgSize`. To avoid under-allocating those
+        // arrays when the runtime subgroup size is smaller than the max,
+        // size this from the minimum subgroup size, with a safe fallback
+        // for devices that do not report it.
+        const minSubgroupSize = device.minSubgroupSize || device.maxSubgroupSize || 32;
+        const maxSubgroups = Math.max(1, Math.ceil(256 / minSubgroupSize));

-            const detailIdx = detailRows.length;
+            const failingCount = BENCH_CONFIGS.reduce((count, cfg) => {
+                const entry = results.get(`${size}:${cfg.label}`);
+                return count + ((entry && entry.failed > 0) ? 1 : 0);
+            }, 0);
+            const detailIdx = detailRows.length + failingCount - 1;

-        Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim);
+        Debug.assert(
+            currentThreadBlocks <= maxPerDim,
+            `ComputeOneSweepRadixSort binning pass requires an exact 1D dispatch; threadBlocks (${currentThreadBlocks}) exceeds maxComputeWorkgroupsPerDimension (${maxPerDim}).`
+        );
+        this._binningDispatchSize.set(currentThreadBlocks, 1);

Conversation

mvaligursky commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

No public API changes

Production routing

Dependencies

Performance

Experiments that didn't make it

Platform coverage / benchmarks

Test plan

Uh oh!

LeXXik commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mvaligursky commented Apr 20, 2026 •

edited

Loading