Draft
Conversation
New allocator design: - Reserve large VA range up front (cheap, just address space) - Map physical memory in large chunks (256 MiB default) - hipMemSetAccess called once per chunk, not per allocation - Sub-allocate with bump pointer, power-of-two free lists for reuse - GC via weakref finalizers on tensor.untyped_storage() - Free/reuse is pure bookkeeping (no HIP calls, no physical remap) - refresh_peer_access only triggered on chunk growth, not every allocation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was using 10 float32 elements (40 bytes) for "small", which SymmetricHeap.allocate() rounds up to granularity/4 = 1024 elements on MI355X (4KiB granularity). This puts "small" in the same power-of-two bucket as "medium" (1024 elements), causing pointer swaps on free-list reuse. Fix: derive test sizes from the allocator's actual granularity so each allocation lands in a distinct power-of-two bucket (1x, 4x, 16x granularity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UF imports DMA-BUF handles imported from PyTorch's default allocator (not VMem-created) already have device access set. hipMemSetAccess fails with "invalid argument" on such handles. The mem_map is sufficient for the VA mapping; treat the set_access error as non-fatal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
import_external_tensor creates pseudo-chunks with DMA-BUF imported handles that cannot be re-exported via mem_export_to_shareable_handle. Track these in a separate _import_chunks list so get_allocation_chunks() only returns VMem-created chunks that can be safely shared with peers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 64 GiB default VA reservation per rank caused hipIpcGetMemHandle failures when NCCL tried to allocate IPC-compatible memory after many tests created and destroyed iris contexts. Changes: - Default VA size is now auto-sized to 8x heap_size (min 256 MiB) instead of a fixed 64 GiB - Add SymmetricHeap.close() to free _peer_va_ranges and fd sockets that were previously leaked on context destruction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cap chunk_size to heap_size to avoid a single chunk consuming the entire VA range (was 256 MiB chunk for 1 MiB heap = no room to grow) - Increase VA multiplier to 16x heap_size for growth + import headroom - Fix SymmetricHeap.__del__ to handle partial init and Python shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add torch.cuda.synchronize() before unmapping VMem chunks in close(). Async GPU kernels (.zero_(), .fill_()) may still be accessing mapped memory when close() is called. Unmapping while kernels are in-flight causes a GPU page fault that poisons the HIP runtime state, making all subsequent GPU operations fail with hipErrorUnknown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Async GPU operations (NCCL collectives, .zero_(), .fill_()) may still reference mapped virtual addresses when close() is called. Freeing VA ranges while kernels are in-flight causes hipErrorUnknown, which poisons the HIP runtime state and fails all subsequent GPU operations. Add torch.cuda.synchronize() at both SymmetricHeap and VMemChunkedAllocator close() entry points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root cause of hipErrorUnknown after multiple context create/destroy cycles was improper cleanup of peer-imported VMem mappings: 1. In _refresh_peer_access_chunked(), imported handles from mem_import_from_shareable_handle() were local variables that leaked — never stored for later cleanup. 2. In SymmetricHeap.close(), peer VA ranges were freed via mem_address_free() WITHOUT first calling mem_unmap() on the chunks mapped into those VA ranges. 3. The imported handles were never released via mem_release(). Calling mem_address_free() on a VA range with active mappings corrupts HIP runtime state, causing hipErrorUnknown on all subsequent GPU operations across new context cycles. Fix: - Track all peer-imported handles and their VA mappings in _peer_imported_mappings dict. - In close(), unmap and release all peer-imported chunks BEFORE calling mem_address_free() on the peer VA ranges. - Fix same bug in _refresh_peer_access_segmented() path. - Fix Iris.__del__() to call heap.close() instead of just allocator.close(), ensuring peer VA cleanup actually runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sors The VMem API (hipMemImportFromShareableHandle + hipMemMap + hipMemSetAccess) does not work for importing DMA-BUF handles exported from hipMalloc-backed PyTorch allocations. hipMemSetAccess returns hipErrorInvalidValue on such handles, leaving the mapping inaccessible and corrupting subsequent GPU ops. Switch to the External Memory API (hipImportExternalMemory + hipExternalMemoryGetMappedBuffer) which correctly handles DMA-BUF fds from any source including PyTorch's caching allocator. Also update owns_tensor() to check imported external memory ranges since they are no longer mapped into the allocator's VA range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Synchronize GPU before del to ensure async ops release storage refs, and add a second gc.collect() pass to handle reference cycles that may prevent the weakref finalizer from firing on the first pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The num_free_blocks check depends on weakref finalizer timing which varies across test orderings. GC-based free/reuse is already covered by test_chunked_gc_free_reuse and test_chunked_gc_multiple_reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VA reservation (hipMemAddressReserve) is just address space — no physical memory cost. 128 GiB provides ample headroom for growth and imports without risk of VA exhaustion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported memory objects, causing null dereference in hipMemSetAccess). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported memory objects, causing null dereference in hipMemSetAccess). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist