Skip to content

Add vmem chunked allocator#516

Draft
mawad-amd wants to merge 19 commits intomainfrom
muhaawad/vmem-chunked-allocator
Draft

Add vmem chunked allocator#516
mawad-amd wants to merge 19 commits intomainfrom
muhaawad/vmem-chunked-allocator

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

mawad-amd and others added 19 commits March 25, 2026 01:52
New allocator design:
- Reserve large VA range up front (cheap, just address space)
- Map physical memory in large chunks (256 MiB default)
- hipMemSetAccess called once per chunk, not per allocation
- Sub-allocate with bump pointer, power-of-two free lists for reuse
- GC via weakref finalizers on tensor.untyped_storage()
- Free/reuse is pure bookkeeping (no HIP calls, no physical remap)
- refresh_peer_access only triggered on chunk growth, not every allocation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was using 10 float32 elements (40 bytes) for "small", which
SymmetricHeap.allocate() rounds up to granularity/4 = 1024 elements
on MI355X (4KiB granularity). This puts "small" in the same
power-of-two bucket as "medium" (1024 elements), causing pointer
swaps on free-list reuse.

Fix: derive test sizes from the allocator's actual granularity so
each allocation lands in a distinct power-of-two bucket (1x, 4x, 16x
granularity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UF imports

DMA-BUF handles imported from PyTorch's default allocator (not VMem-created)
already have device access set. hipMemSetAccess fails with "invalid argument"
on such handles. The mem_map is sufficient for the VA mapping; treat the
set_access error as non-fatal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
import_external_tensor creates pseudo-chunks with DMA-BUF imported handles
that cannot be re-exported via mem_export_to_shareable_handle. Track these
in a separate _import_chunks list so get_allocation_chunks() only returns
VMem-created chunks that can be safely shared with peers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 64 GiB default VA reservation per rank caused hipIpcGetMemHandle
failures when NCCL tried to allocate IPC-compatible memory after many
tests created and destroyed iris contexts.

Changes:
- Default VA size is now auto-sized to 8x heap_size (min 256 MiB)
  instead of a fixed 64 GiB
- Add SymmetricHeap.close() to free _peer_va_ranges and fd sockets
  that were previously leaked on context destruction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cap chunk_size to heap_size to avoid a single chunk consuming the
  entire VA range (was 256 MiB chunk for 1 MiB heap = no room to grow)
- Increase VA multiplier to 16x heap_size for growth + import headroom
- Fix SymmetricHeap.__del__ to handle partial init and Python shutdown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add torch.cuda.synchronize() before unmapping VMem chunks in close().
Async GPU kernels (.zero_(), .fill_()) may still be accessing mapped
memory when close() is called. Unmapping while kernels are in-flight
causes a GPU page fault that poisons the HIP runtime state, making
all subsequent GPU operations fail with hipErrorUnknown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Async GPU operations (NCCL collectives, .zero_(), .fill_()) may still
reference mapped virtual addresses when close() is called. Freeing VA
ranges while kernels are in-flight causes hipErrorUnknown, which
poisons the HIP runtime state and fails all subsequent GPU operations.

Add torch.cuda.synchronize() at both SymmetricHeap and
VMemChunkedAllocator close() entry points.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root cause of hipErrorUnknown after multiple context create/destroy
cycles was improper cleanup of peer-imported VMem mappings:

1. In _refresh_peer_access_chunked(), imported handles from
   mem_import_from_shareable_handle() were local variables that leaked
   — never stored for later cleanup.

2. In SymmetricHeap.close(), peer VA ranges were freed via
   mem_address_free() WITHOUT first calling mem_unmap() on the
   chunks mapped into those VA ranges.

3. The imported handles were never released via mem_release().

Calling mem_address_free() on a VA range with active mappings
corrupts HIP runtime state, causing hipErrorUnknown on all
subsequent GPU operations across new context cycles.

Fix:
- Track all peer-imported handles and their VA mappings in
  _peer_imported_mappings dict.
- In close(), unmap and release all peer-imported chunks BEFORE
  calling mem_address_free() on the peer VA ranges.
- Fix same bug in _refresh_peer_access_segmented() path.
- Fix Iris.__del__() to call heap.close() instead of just
  allocator.close(), ensuring peer VA cleanup actually runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sors

The VMem API (hipMemImportFromShareableHandle + hipMemMap + hipMemSetAccess)
does not work for importing DMA-BUF handles exported from hipMalloc-backed
PyTorch allocations. hipMemSetAccess returns hipErrorInvalidValue on such
handles, leaving the mapping inaccessible and corrupting subsequent GPU ops.

Switch to the External Memory API (hipImportExternalMemory +
hipExternalMemoryGetMappedBuffer) which correctly handles DMA-BUF fds
from any source including PyTorch's caching allocator.

Also update owns_tensor() to check imported external memory ranges since
they are no longer mapped into the allocator's VA range.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Synchronize GPU before del to ensure async ops release storage refs,
and add a second gc.collect() pass to handle reference cycles that
may prevent the weakref finalizer from firing on the first pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The num_free_blocks check depends on weakref finalizer timing which
varies across test orderings. GC-based free/reuse is already covered
by test_chunked_gc_free_reuse and test_chunked_gc_multiple_reuse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VA reservation (hipMemAddressReserve) is just address space — no
physical memory cost. 128 GiB provides ample headroom for growth
and imports without risk of VA exhaustion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted
MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported
memory objects, causing null dereference in hipMemSetAccess).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted
MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported
memory objects, causing null dereference in hipMemSetAccess).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants