perf: add radix sort for GSplat tile sorting on NVIDIA GPUs by mvaligursky · Pull Request #8609 · playcanvas/engine

mvaligursky · 2026-04-17T10:18:02Z

Add a per-tile radix sort pass as a faster alternative to bitonic sort for tiles with ≤1976 entries on NVIDIA GPUs. Uses a 5-pass, 4-bit radix sort with subgroupBallot-based stable scatter, operating entirely in 16KB of workgroup shared memory.

Changes:

Add workgroup-local radix sort shader using subgroupBallot with .x-only optimization (correct for sgSize ≤ 32)
Three-tier tile classification when radix is enabled: radix (≤1976 entries), bitonic (1977–4096), large (>4096 via bucket+chunk)
Gate radix sort to NVIDIA only — Apple shows a performance regression vs bitonic; AMD wave64 (sgSize 64) would produce incorrect results with .x-only ballot; other vendors untested
Add GPU vendor detection via capsDefines (VENDOR_NVIDIA, VENDOR_APPLE, etc.) by overriding initCapsDefines() on WebgpuGraphicsDevice
Rename bitonic sort profiler labels to GSplatLocalTileBitonicSort / GSplatLocalChunkBitonicSort for clarity

Performance:

On NVIDIA (RTX 2070 Super), radix sort is faster than bitonic for tiles with ~1000+ entries, with increasing advantage at higher splat counts
On Apple M4 Max, radix sort shows a 5–50% regression vs bitonic — excluded via vendor check
No performance impact on non-NVIDIA hardware (radix path is completely skipped)

Add a per-tile radix sort pass as a faster alternative to bitonic sort for tiles with ≤1976 entries on NVIDIA GPUs. Uses a 5-pass, 4-bit radix sort with subgroupBallot-based stable scatter, operating entirely in 16KB of workgroup shared memory. - Add workgroup-local radix sort shader using subgroupBallot with .x-only optimization (correct for sgSize <= 32) - Three-tier tile classification when radix is enabled: radix (<=1976), bitonic (1977-4096), large (>4096 via bucket+chunk) - Gate radix sort to NVIDIA only -- Apple shows perf regression vs bitonic; AMD wave64 (sgSize 64) would produce incorrect results - Add GPU vendor detection via capsDefines (VENDOR_NVIDIA, etc.) by overriding initCapsDefines() on WebgpuGraphicsDevice - Rename bitonic sort profiler labels for clarity

mvaligursky self-assigned this Apr 17, 2026

mvaligursky marked this pull request as draft April 17, 2026 10:18

vercel Bot deployed to Preview – engine-api-docs April 17, 2026 10:18 View deployment

vercel Bot deployed to Preview – engine April 17, 2026 10:19 View deployment

Merge branch 'main' into mv-gsplat-radix-sort

ca5f610

vercel Bot deployed to Preview – engine-api-docs April 17, 2026 14:00 View deployment

vercel Bot deployed to Preview – engine April 17, 2026 14:00 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add radix sort for GSplat tile sorting on NVIDIA GPUs#8609

perf: add radix sort for GSplat tile sorting on NVIDIA GPUs#8609
mvaligursky wants to merge 2 commits intomainfrom
mv-gsplat-radix-sort

mvaligursky commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvaligursky commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant