Skip to content

feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620

Draft
mvaligursky wants to merge 2 commits intomainfrom
mv-radix-8bit-subgroup
Draft

feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620
mvaligursky wants to merge 2 commits intomainfrom
mv-radix-8bit-subgroup

Conversation

@mvaligursky
Copy link
Copy Markdown
Contributor

@mvaligursky mvaligursky commented Apr 20, 2026

Summary

  • New: ComputeOneSweepRadixSort — single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder. Ported from b0nes164/GPUSorting (Thomas Smith, MIT). NVIDIA Turing+ only — requires forward-thread-progress guarantees that Apple Silicon, Mali and Adreno do not provide reliably. Not wired up in production yet.
  • Improved: existing ComputeRadixSort (4-bit multi-pass) — decouples dispatch/scan size from buffer capacity so a small sort following a large one no longer pays the full capacity cost. Also adds a radixBits getter so callers can align key-bit counts generically across sort backends.
  • New: interactive benchmark example (test/radix-sort-compute) comparing both backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional CPU-reference validation.

No public API changes

Both ComputeRadixSort and ComputeOneSweepRadixSort are tagged @ignore — they are engine-internal and not part of the published API. The radixBits getter added to ComputeRadixSort is likewise internal, used by the gsplat manager / benchmark to route key-bit counts between backends.

Production routing

gsplat-manager.js continues to use new ComputeRadixSort(device) unconditionally. A follow-up PR will add a thin facade that routes NVIDIA adapters to ComputeOneSweepRadixSort. This PR is deliberately just "make the backend available + benchmark it".

Dependencies

Depends on device.minSubgroupSize / device.maxSubgroupSize added in #8645 (already on main). ComputeOneSweepRadixSort parametrises its shared-memory layout from device.maxSubgroupSize to support 16 / 32 / 64 / 128 lane subgroup hardware from the same source.

Performance

Sort-only GPU time (Forward pass excluded), 24-bit keys, 20 warmup + 50 measured frames per cell. OneSweep-safe below is the CSDLDF-backed fallback that got cut.

NVIDIA RTX 2070 (Turing) — OneSweep dominates across the full range (up to ~5× faster than 4-bit at 25M+). 8-coalesced is a respectable second but has no routing advantage over OneSweep. OneSweep-safe (CSDLDF) collapses badly past 5M — the reason the CSDLDF fallback was dropped.

2070

NVIDIA Quadro M2000 (Ampere) — same shape as Turing: OneSweep wins, CSDLDF-backed variant degrades sharply past a few million elements.

Screenshot 2026-04-23 at 18 21 13

Apple M1 (metal-3) — 4-bit is the clear winner across the range that matters (4–8M target). 8-ranked regresses past 5M (0.53× at 20M) and 8-coalesced is similar. OneSweep (plain and CSDLDF-safe) is non-competitive on Apple because lookback stalls are uncapped. This chart drove the "ship 4-bit as the single portable fallback" decision.

m1

Apple M4 (metal-3) — very different story: 8-ranked is 1.2–1.7× faster than 4-bit across every size and even beats OneSweep. M4 is not the Apple priority though (it already has a fast local-compute GSplat path), and since we ship one Apple fallback, M1/M2 win the tie-break. Leaving 8-ranked in the branch history for a potential future M4-specific routing.

M4

Experiments that didn't make it

Several alternative radix-sort variants were explored before settling on "4-bit portable + OneSweep on NVIDIA". Listing them here so the reasoning isn't lost — the history is preserved on branch mv-radix-8bit-subgroup up to commit be5ffb304 if anyone needs to resurrect one.

  • CSDLDF (Chained-Scan Decoupled-Lookback Decoupled-Fallback) scan kernel, plus a dedicated scan-kernel factory that switched between CSDLDF and Blelloch. Intended to give OneSweep a portable fallback for devices without forward-thread-progress guarantees. On NVIDIA it validates correctly but serialises badly (tens to >100× slower than plain lookback at larger sizes). On Apple M4 it's slightly faster than plain OneSweep at 2–3M (10–20% win), but both OneSweep variants are dominated by 4-bit on Apple, so the theoretical advantage doesn't translate into a real win for any device in the current matrix. Removed for now — the algorithm itself is sound and may be worth resurrecting if a device shows up where plain lookback deadlocks and CSDLDF beats 4-bit.

  • 8-bit shared-memory radix sort (no subgroup intrinsics). Wider radix → fewer passes, but the resulting 256-bucket bitmasks blow out shared memory enough that Mali hit a recursion cliff and M4 regressed vs 4-bit. Removed.

  • 8-bit subgroup radix sort in four reorder variants (ballot, packed, ranked, coalesced). Uses subgroup intrinsics for ranking. Genuinely strong on Apple M4 — ranked is 1.2–1.7× faster than 4-bit across every size and beats OneSweep too. However, M1/M2 are the Apple priority (M4 already has a fast local-compute GSplat path), and on M1/M2 the ranked variant regresses past 5M (0.53× at 20M). Since we only ship one Apple fallback, M1/M2 perf wins, so 4-bit is the safer pick. 8-coalesced is also competitive on NVIDIA Ampere up to ~4M but OneSweep covers that range comfortably. Removed for simplicity — if an M4-specific (or desktop-class Apple) routing becomes worthwhile, ranked is the variant to revisit.

  • 6-bit shared-memory radix sort, with two optimisations on top (bank-conflict-friendly bitmask layout, shadow-copy bitmask reads). Attractive on paper as the middle ground between 4-bit and 8-bit. Apple M4 and Pixel 8 Pro looked promising, but Apple M1/M2 regressed vs 4-bit in the 4–8M target range (the range that actually matters for those devices — M1/M2 lacks the compute perf headroom of M4). Shadow-copy was reverted to test whether LDS-occupancy pressure explained the regression; the regression remained, so the cause is architectural rather than LDS pressure. Removed.

Net result: one portable backend (ComputeRadixSort, 4-bit) that wins or is close-second on every target, plus one NVIDIA-specialised backend (ComputeOneSweepRadixSort, OneSweep) for where the hardware/driver can actually deliver on its lookback-based fusion.

Platform coverage / benchmarks

Benchmarking was done on:

  • NVIDIA RTX 2070 (Turing), NVIDIA Quadro M2000 (Ampere)
  • Apple M1, M2, M4
  • iPhone 13 Pro (A15)
  • Android Pixel 8 Pro (Tensor G3 / Mali-G715 Valhall)
  • Android Pixel 10 Pro (Tensor G5 / Imagination IMG)

For the Unified GSplat manager's dynamic 10–20 bit key range, ComputeRadixSort validates and performs consistently across the full platform matrix; ComputeOneSweepRadixSort validates and wins by a meaningful margin on NVIDIA but fails validation on the mobile SoCs tested.

Test plan

  • Run benchmark on NVIDIA (RTX 2070+): both backends validate across 1K–50M elements and 4/8/16/24/32-bit widths
  • Run benchmark on Apple M1/M2/M4: ComputeRadixSort validates across full range; OneSweep should remain disabled
  • Run benchmark on Android (Pixel 8/10 Pro): ComputeRadixSort validates across full range
  • Existing GSplat scenes using gsplat-manager.js continue to sort correctly (still routes to ComputeRadixSort)

@LeXXik
Copy link
Copy Markdown
Contributor

LeXXik commented Apr 20, 2026

3090:

image

Note: WebGPU GPU Sort step breaks here - black screen.

- ComputeRadixSort: 4-bit shared-memory multi-pass radix sort, WebGPU-
  only, runs on every device with no subgroup requirement. Selected as
  the portable fallback after benchmarking wider radixes (6-bit, 8-bit
  shared, 8-bit subgroup variants) on Apple M1/M2/M4, NVIDIA, and
  Android Mali/IMG; 4-bit is the single most consistent choice in every
  target range.

- ComputeOneSweepRadixSort: single-sweep 8-bit radix sort with fused
  histogram / decoupled-lookback / reorder, ported from b0nes164/
  GPUSorting. NVIDIA Turing+ only (gated by callers) -- requires
  forward-thread-progress guarantees that Apple Silicon and mobile
  Mali/Adreno do not provide reliably.

- Radix-sort benchmark example (test/radix-sort-compute): compares the
  two backends across 4/8/16/24/32-bit key widths and element counts
  from 1K to 50M, with optional validation against a CPU reference
  implementation.
@mvaligursky mvaligursky changed the title feat: experimental 8-bit radix sort with CSDLDF scan + subgroup variants feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA) Apr 23, 2026
Leftover from an earlier iteration that had a scan-kernel factory with
Blelloch and CSDLDF variants. PrefixSumKernel.constructor only takes
device, so the object was silently ignored.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an engine-internal WebGPU OneSweep radix sort backend (plus WGSL kernels), improves the existing portable 4-bit compute radix sort’s dispatch/scan sizing, and ships an interactive benchmark/validation example to compare the backends.

Changes:

  • Add OneSweep WGSL kernels (global histogram, scan, fused binning/lookback).
  • Add ComputeOneSweepRadixSort implementation and export it.
  • Update ComputeRadixSort to decouple dispatch/scan size from buffer capacity and expose radixBits; expand the example to benchmark/validate both modes.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-scan.js New WGSL scan kernel for OneSweep histogram prefixes.
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-global-hist.js New WGSL fused global histogram kernel (vec4 loads).
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-binning.js New WGSL fused binning + lookback kernel (OneSweep core).
src/scene/graphics/compute-radix-sort.js Dispatch/scan sizing fixed to follow current elementCount; add radixBits; rename GPU profiler pass labels.
src/scene/graphics/compute-onesweep-radix-sort.js New OneSweep compute sort backend, shaders/formats, buffer management, dispatch flow.
src/index.js Export ComputeOneSweepRadixSort.
examples/src/examples/test/radix-sort-compute.example.mjs Add mode selection, overlays, benchmark + validation harnesses.
examples/src/examples/test/radix-sort-compute.controls.mjs Add UI controls for mode/render/validation and benchmark/validate actions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +357 to +363
// `MAX_SUBGROUPS` is the number of subgroups in a 256-thread
// workgroup at this device's subgroup size. Both the scan and
// binning kernels size their per-subgroup reduction slots by this
// value; hardcoding it to 8 (the sgSize=32 case) is incorrect on
// hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16
// → 16 subgroups), so we specialize the shader per device here.
const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32)));
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxSubgroups is computed from device.maxSubgroupSize, but subgroup_size may be any value in [minSubgroupSize, maxSubgroupSize]. For workgroup-local arrays indexed by waveIndex = gtid / sgSize, you need to size for the minimum subgroup size (worst-case: most subgroups) or you can under-allocate and index out of bounds when the runtime subgroup size is smaller than the max, corrupting scan/binning results. Compute this from device.minSubgroupSize (with a safe fallback when it’s 0) and update the related shader comments accordingly.

Suggested change
// `MAX_SUBGROUPS` is the number of subgroups in a 256-thread
// workgroup at this device's subgroup size. Both the scan and
// binning kernels size their per-subgroup reduction slots by this
// value; hardcoding it to 8 (the sgSize=32 case) is incorrect on
// hardware with smaller subgroups (e.g. Pixel / Mali at sgSize=16
// → 16 subgroups), so we specialize the shader per device here.
const maxSubgroups = Math.max(1, Math.ceil(256 / (device.maxSubgroupSize || 32)));
// `MAX_SUBGROUPS` is the worst-case number of subgroups in a
// 256-thread workgroup. The runtime `subgroup_size` may be any
// value in [minSubgroupSize, maxSubgroupSize], and the scan and
// binning kernels size per-subgroup reduction slots indexed by
// `waveIndex = gtid / sgSize`. To avoid under-allocating those
// arrays when the runtime subgroup size is smaller than the max,
// size this from the minimum subgroup size, with a safe fallback
// for devices that do not report it.
const minSubgroupSize = device.minSubgroupSize || device.maxSubgroupSize || 32;
const maxSubgroups = Math.max(1, Math.ceil(256 / minSubgroupSize));

Copilot uses AI. Check for mistakes.
*/
constructor(device) {
Debug.assert(device.supportsCompute, 'ComputeOneSweepRadixSort requires compute shader support (WebGPU)');
Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support');
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OneSweep binning WGSL uses 32-bit lane masks (e.g. subgroupBallot(...).x and (1u << sgInvId)), which is only valid when the runtime subgroup size is ≤ 32. On devices with 64/128-lane subgroups this becomes incorrect/UB and can silently corrupt output. Add a runtime guard (e.g. assert device.maxSubgroupSize <= 32 and device.minSubgroupSize > 0) or extend the shader to handle full 128-bit subgroup ballots.

Suggested change
Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support');
Debug.assert(device.supportsSubgroups, 'ComputeOneSweepRadixSort requires subgroup support');
Debug.assert(device.minSubgroupSize > 0, 'ComputeOneSweepRadixSort requires a valid runtime subgroup size');
Debug.assert(device.maxSubgroupSize <= 32, 'ComputeOneSweepRadixSort currently requires subgroup sizes <= 32 because the OneSweep binning shader uses 32-bit subgroup masks');

Copilot uses AI. Check for mistakes.
}

if (rowHasDetails) {
const detailIdx = detailRows.length;
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation “Details” toggle index is inconsistent: the button uses data-toggle set to detailRows.length before pushing this size’s failing modes, but the hidden detail row uses data-detail-content set to detailRows.length - 1 after pushing. If a size has multiple failing modes, the button will point at a non-existent row and won’t open the details. Set the button’s data-toggle to the same id used by the inserted detail row (or compute a per-size id independent of detailRows.length).

Suggested change
const detailIdx = detailRows.length;
const failingCount = BENCH_CONFIGS.reduce((count, cfg) => {
const entry = results.get(`${size}:${cfg.label}`);
return count + ((entry && entry.failed > 0) ? 1 : 0);
}, 0);
const detailIdx = detailRows.length + failingCount - 1;

Copilot uses AI. Check for mistakes.
this._threadBlocks = currentThreadBlocks;

const maxPerDim = this.device.limits.maxComputeWorkgroupsPerDimension || 65535;
Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DigitBinningPass assigns tiles via an atomic counter (b_index) and then uses partitionIndex to index b_passHist. This requires that the dispatch launches exactly threadBlocks workgroups. Using Compute.calcDispatchSize can pad x*y above threadBlocks when threadBlocks > maxComputeWorkgroupsPerDimension, causing extra workgroups to claim partitionIndex >= threadBlocks and then read/write b_passHist out of bounds. Consider (a) asserting threadBlocks <= maxPerDim and dispatching 1D, or (b) adding an early-exit in WGSL when partitionIndex >= uniforms.threadBlocks before any passHistOffset access.

Suggested change
Compute.calcDispatchSize(currentThreadBlocks, this._binningDispatchSize, maxPerDim);
Debug.assert(
currentThreadBlocks <= maxPerDim,
`ComputeOneSweepRadixSort binning pass requires an exact 1D dispatch; threadBlocks (${currentThreadBlocks}) exceeds maxComputeWorkgroupsPerDimension (${maxPerDim}).`
);
this._binningDispatchSize.set(currentThreadBlocks, 1);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants