Skip to content

perf: add radix sort for GSplat tile sorting on NVIDIA GPUs#8609

Draft
mvaligursky wants to merge 2 commits intomainfrom
mv-gsplat-radix-sort
Draft

perf: add radix sort for GSplat tile sorting on NVIDIA GPUs#8609
mvaligursky wants to merge 2 commits intomainfrom
mv-gsplat-radix-sort

Conversation

@mvaligursky
Copy link
Copy Markdown
Contributor

Add a per-tile radix sort pass as a faster alternative to bitonic sort for tiles with ≤1976 entries on NVIDIA GPUs. Uses a 5-pass, 4-bit radix sort with subgroupBallot-based stable scatter, operating entirely in 16KB of workgroup shared memory.

Changes:

  • Add workgroup-local radix sort shader using subgroupBallot with .x-only optimization (correct for sgSize ≤ 32)
  • Three-tier tile classification when radix is enabled: radix (≤1976 entries), bitonic (1977–4096), large (>4096 via bucket+chunk)
  • Gate radix sort to NVIDIA only — Apple shows a performance regression vs bitonic; AMD wave64 (sgSize 64) would produce incorrect results with .x-only ballot; other vendors untested
  • Add GPU vendor detection via capsDefines (VENDOR_NVIDIA, VENDOR_APPLE, etc.) by overriding initCapsDefines() on WebgpuGraphicsDevice
  • Rename bitonic sort profiler labels to GSplatLocalTileBitonicSort / GSplatLocalChunkBitonicSort for clarity

Performance:

  • On NVIDIA (RTX 2070 Super), radix sort is faster than bitonic for tiles with ~1000+ entries, with increasing advantage at higher splat counts
  • On Apple M4 Max, radix sort shows a 5–50% regression vs bitonic — excluded via vendor check
  • No performance impact on non-NVIDIA hardware (radix path is completely skipped)

Add a per-tile radix sort pass as a faster alternative to bitonic sort
for tiles with ≤1976 entries on NVIDIA GPUs. Uses a 5-pass, 4-bit radix
sort with subgroupBallot-based stable scatter, operating entirely in
16KB of workgroup shared memory.

- Add workgroup-local radix sort shader using subgroupBallot with
  .x-only optimization (correct for sgSize <= 32)
- Three-tier tile classification when radix is enabled: radix (<=1976),
  bitonic (1977-4096), large (>4096 via bucket+chunk)
- Gate radix sort to NVIDIA only -- Apple shows perf regression vs
  bitonic; AMD wave64 (sgSize 64) would produce incorrect results
- Add GPU vendor detection via capsDefines (VENDOR_NVIDIA, etc.) by
  overriding initCapsDefines() on WebgpuGraphicsDevice
- Rename bitonic sort profiler labels for clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant