Add vmem chunked allocator by mawad-amd · Pull Request #516 · ROCm/iris

mawad-amd · 2026-04-23T20:16:20Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

New allocator design: - Reserve large VA range up front (cheap, just address space) - Map physical memory in large chunks (256 MiB default) - hipMemSetAccess called once per chunk, not per allocation - Sub-allocate with bump pointer, power-of-two free lists for reuse - GC via weakref finalizers on tensor.untyped_storage() - Free/reuse is pure bookkeeping (no HIP calls, no physical remap) - refresh_peer_access only triggered on chunk growth, not every allocation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The test was using 10 float32 elements (40 bytes) for "small", which SymmetricHeap.allocate() rounds up to granularity/4 = 1024 elements on MI355X (4KiB granularity). This puts "small" in the same power-of-two bucket as "medium" (1024 elements), causing pointer swaps on free-list reuse. Fix: derive test sizes from the allocator's actual granularity so each allocation lands in a distinct power-of-two bucket (1x, 4x, 16x granularity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…UF imports DMA-BUF handles imported from PyTorch's default allocator (not VMem-created) already have device access set. hipMemSetAccess fails with "invalid argument" on such handles. The mem_map is sufficient for the VA mapping; treat the set_access error as non-fatal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

import_external_tensor creates pseudo-chunks with DMA-BUF imported handles that cannot be re-exported via mem_export_to_shareable_handle. Track these in a separate _import_chunks list so get_allocation_chunks() only returns VMem-created chunks that can be safely shared with peers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The 64 GiB default VA reservation per rank caused hipIpcGetMemHandle failures when NCCL tried to allocate IPC-compatible memory after many tests created and destroyed iris contexts. Changes: - Default VA size is now auto-sized to 8x heap_size (min 256 MiB) instead of a fixed 64 GiB - Add SymmetricHeap.close() to free _peer_va_ranges and fd sockets that were previously leaked on context destruction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Cap chunk_size to heap_size to avoid a single chunk consuming the entire VA range (was 256 MiB chunk for 1 MiB heap = no room to grow) - Increase VA multiplier to 16x heap_size for growth + import headroom - Fix SymmetricHeap.__del__ to handle partial init and Python shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add torch.cuda.synchronize() before unmapping VMem chunks in close(). Async GPU kernels (.zero_(), .fill_()) may still be accessing mapped memory when close() is called. Unmapping while kernels are in-flight causes a GPU page fault that poisons the HIP runtime state, making all subsequent GPU operations fail with hipErrorUnknown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Async GPU operations (NCCL collectives, .zero_(), .fill_()) may still reference mapped virtual addresses when close() is called. Freeing VA ranges while kernels are in-flight causes hipErrorUnknown, which poisons the HIP runtime state and fails all subsequent GPU operations. Add torch.cuda.synchronize() at both SymmetricHeap and VMemChunkedAllocator close() entry points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The root cause of hipErrorUnknown after multiple context create/destroy cycles was improper cleanup of peer-imported VMem mappings: 1. In _refresh_peer_access_chunked(), imported handles from mem_import_from_shareable_handle() were local variables that leaked — never stored for later cleanup. 2. In SymmetricHeap.close(), peer VA ranges were freed via mem_address_free() WITHOUT first calling mem_unmap() on the chunks mapped into those VA ranges. 3. The imported handles were never released via mem_release(). Calling mem_address_free() on a VA range with active mappings corrupts HIP runtime state, causing hipErrorUnknown on all subsequent GPU operations across new context cycles. Fix: - Track all peer-imported handles and their VA mappings in _peer_imported_mappings dict. - In close(), unmap and release all peer-imported chunks BEFORE calling mem_address_free() on the peer VA ranges. - Fix same bug in _refresh_peer_access_segmented() path. - Fix Iris.__del__() to call heap.close() instead of just allocator.close(), ensuring peer VA cleanup actually runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sors The VMem API (hipMemImportFromShareableHandle + hipMemMap + hipMemSetAccess) does not work for importing DMA-BUF handles exported from hipMalloc-backed PyTorch allocations. hipMemSetAccess returns hipErrorInvalidValue on such handles, leaving the mapping inaccessible and corrupting subsequent GPU ops. Switch to the External Memory API (hipImportExternalMemory + hipExternalMemoryGetMappedBuffer) which correctly handles DMA-BUF fds from any source including PyTorch's caching allocator. Also update owns_tensor() to check imported external memory ranges since they are no longer mapped into the allocator's VA range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Synchronize GPU before del to ensure async ops release storage refs, and add a second gc.collect() pass to handle reference cycles that may prevent the weakref finalizer from firing on the first pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The num_free_blocks check depends on weakref finalizer timing which varies across test orderings. GC-based free/reuse is already covered by test_chunked_gc_free_reuse and test_chunked_gc_multiple_reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

VA reservation (hipMemAddressReserve) is just address space — no physical memory cost. 128 GiB provides ample headroom for growth and imports without risk of VA exhaustion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported memory objects, causing null dereference in hipMemSetAccess). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd and others added 19 commits March 25, 2026 01:52

Apply Ruff auto-fixes

01baac9

Apply Ruff auto-fixes

59a0cf0

Apply Ruff auto-fixes

e0b991a

Apply Ruff auto-fixes

981d5bb

Increase default VA reservation to 128 GiB

966fa6d

VA reservation (hipMemAddressReserve) is just address space — no physical memory cost. 128 GiB provides ample headroom for growth and imports without risk of VA exhaustion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd assigned artulab Apr 23, 2026

github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vmem chunked allocator#516

Add vmem chunked allocator#516
mawad-amd wants to merge 19 commits intomainfrom
muhaawad/vmem-chunked-allocator

mawad-amd commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mawad-amd commented Apr 23, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants