feat: Megatron NCCL destroy/reload helpers and tests (Task 3)#4
feat: Megatron NCCL destroy/reload helpers and tests (Task 3)#4Yankail97 wants to merge 3 commits intorlops:mainfrom
Conversation
…ASK 3
Introduces nemo_rl.models.megatron.nccl_offload with two public helpers:
- destroy_megatron_nccl_groups(): walks megatron.core.parallel_state,
collects every NCCL ProcessGroup (TP/PP/DP/CP/EP and hybrid list-valued
holders), deduplicates by id, skips Gloo backends, calls
torch.distributed.destroy_process_group on each, then nulls out the
parallel_state global handles so model_parallel_is_initialized() returns
False. Captures a parallel-config snapshot (tp/pp/cp/ep + optional VPP)
for a subsequent reload call. Returns per-call stats (initialized_before,
num_groups_destroyed, vram_freed_bytes, state_snapshot) for
observability.
- reload_megatron_nccl_groups(state_snapshot): rebuilds NCCL groups via
parallel_state.initialize_model_parallel using the captured snapshot,
followed by a barrier. Idempotent; raises if called with an empty
snapshot.
Both helpers are idempotent, tolerate per-group destroy failures, and use
introspection (rather than a hardcoded global name list) so they adapt to
Megatron-core version changes.
These helpers are infrastructure only. They are not wired into any training
loop in this change; TASK 5 (Feature 11: DO_TIME_SHARING flag + grpo.py
branch) is responsible for invoking them at the correct phase of the
partial-overlap lifecycle. See plans/nemorl-port-plan.md Feature 11 and
task3-plan.md for context.
Tests: 27 mock-based unit tests in tests/unit/models/megatron/test_nccl_offload.py
covering group collection, dedup, Gloo filtering, snapshot round-trip,
idempotency, per-group failure tolerance, VPP handling, and the
destroy/reload API contract. All pass without GPU or a real Megatron
install.
Real GPU validation (Gate 2.5: dp=1/tp=2, 3+ destroy/reload cycles with
actual VRAM delta measurement) is out of scope for this commit and will
be added alongside the Gate 2.5 integration tests.
Made-with: Cursor
…tegration tests
Extends TASK 3 to the full Feature 11 / Gate 2.5 scope required by the
NeMo RL port plan. Two additions on top of the initial manual helper:
1. Official-mode fallback in destroy_megatron_nccl_groups()
The port plan mandates a fallback path using Megatron's official
parallel_state.destroy_model_parallel() that operators can switch to
if Gate 2.5 surfaces issues with the default manual path (VRAM not
released, stale handles, instability across cycles).
- New Literal type NcclOffloadMode = "manual" | "official"
- destroy_megatron_nccl_groups(mode=...) selects the path
- Stats now carry a "mode" field and use -1 as the sentinel for
num_groups_destroyed under the official path (that API does not
expose a count)
- Official path falls through to the manual path if the Megatron
build is too old to expose destroy_model_parallel, and resets
globals defensively if the official API raises
- reload_megatron_nccl_groups() is mode-agnostic: both paths reload
via parallel_state.initialize_model_parallel(**snapshot)
- 8 new mock-based unit tests covering each official-mode branch,
invalid mode rejection, and the manual default
2. GPU multi-process integration tests (Gate 2.5 NCCL lifecycle)
Real-GPU coverage of the destroy/reload cycle. The tests spawn
torchrun as a subprocess to drive a 2-process Megatron NCCL lifecycle
and verify:
- parallel_state reports !initialized after destroy
- VRAM does not grow across destroy
- AllReduce succeeds after each reload
- Post-reload VRAM is stable across 3+ cycles
(tolerance: 200 MiB — catches real leaks, avoids allocator-caching
flakiness)
Test layout:
- tests/unit/models/megatron/_nccl_offload_gpu_worker.py — the
per-rank script invoked by torchrun. Only rank 0 writes a JSON
report; a non-zero exit is forwarded on failure.
- tests/unit/models/megatron/test_nccl_offload_gpu.py — the pytest
wrapper. Auto-skips when CUDA/torchrun/megatron.core is missing,
so CI and MacBook-local pytest runs are not broken.
Covers both mode="manual" and mode="official" so the fallback path
is validated alongside the default.
Test status: 35 unit tests pass locally (without GPU), 4 GPU integration
tests skip cleanly locally and are expected to run green on the Vast.ai
2x GPU environment described in docs/vastai-setup.md.
Made-with: Cursor
…destroy
Fixes the real root cause that blocked Gate 2.5 on RTX 5090. The manual
destroy path freed NCCL communicator buffers correctly but left residual
non-group parallel_state globals set, causing the subsequent reload to
crash deep inside Megatron:
AssertionError: global memory buffer is already initialized
nccl_offload.py:288 reload_megatron_nccl_groups()
parallel_state.py:1925 _set_global_memory_buffer()
Root cause: _reset_parallel_state_globals only nulls attrs matching the
_*_GROUP* heuristic, but Megatron also tracks ~25 non-group module
globals (_GLOBAL_MEMORY_BUFFER, _MPU_*_RANK/WORLD_SIZE,
_VIRTUAL_PIPELINE_*_RANK, _CONTEXT_PARALLEL_GLOBAL_RANKS, etc.) that
must be reset for a subsequent initialize_model_parallel() to succeed.
Enumerating that list ourselves would drift every Megatron upgrade.
1. Manual mode now delegates residual state reset to Megatron
After the per-group destroy_process_group loop (manual mode's unique
value-add — immediate NCCL buffer release) and the group-attr walker,
_destroy_manual invokes parallel_state.destroy_model_parallel() to
let Megatron authoritatively reset its own non-group state.
Wrapped in hasattr + try/except so a missing or mis-behaving API
never masks the per-group destroys we already succeeded at.
2. Mock UT locks the new contract (4 new tests)
TestManualModeResidualReset verifies:
- destroy_model_parallel() is invoked from the manual path
- ordering: per-group destroys run BEFORE the residual reset (else
the refs get nulled and destroy_process_group has nothing to free)
- a raising destroy_model_parallel() does not abort manual mode
- missing destroy_model_parallel() (ancient Megatron) is tolerated
Also removed the now-incorrect assert_not_called() from the existing
test_default_mode_is_manual.
3. GPU test wrapper surfaces worker tracebacks on failure
Previously, when the torchrun subprocess died, pytest only saw the
generic ChildFailedError with 'error_file: <N/A>' — completely
hiding the worker's real Python traceback. _run_worker now reads
the worker's JSON report (written by _nccl_offload_gpu_worker.py)
and surfaces failure.type, failure.message, and failure.traceback
into the AssertionError.
Also added _resolve_torchrun() that finds a torchrun whose Python
actually has megatron.core, since NeMo-RL installs Megatron into a
per-worker uv venv (/opt/ray_venvs/...MegatronPolicyWorker/)
separate from the main venv's torchrun. Resolution order:
NCCL_OFFLOAD_TORCHRUN env (override), NCCL_OFFLOAD_MCORE_VENV env,
autodiscover /opt/ray_venvs/*MegatronPolicyWorker*, then PATH.
Test status:
- 39 mock unit tests pass (incl. 4 new residual-reset contract tests)
- 4 GPU integration tests pass on 2x RTX 5090
(manual 1-cycle, manual 3-cycle stability, official 1-cycle,
official 3-cycle stability; VRAM drift across cycles = 0 bytes)
Made-with: Cursor
|
|
||
|
|
||
| def destroy_megatron_nccl_groups( | ||
| *, |
There was a problem hiding this comment.
We have two modes(manual/official) when destroying NCCL group. Manual mode calls _destroy_manual and official mode calls _destroy_official.
_destroy_manual calls PyTorch destroy_process_group.
_destroy_official calls Megatron destroy_model_parallel to destroy the group(pointer).
Both paths are verified in test_nccl_offload_gpu.py
| if verbose: | ||
| print( | ||
| f"[nccl_offload] destroy failed for {name}: " | ||
| f"{type(exc).__name__}: {exc}" |
There was a problem hiding this comment.
It’s better to raise an error here.
torch.distributed.destroy_process_group(pg) is the critical operation, if it fails, it means the NCCL communicator has not been properly released.
| except Exception as exc: # noqa: BLE001 - surface diagnostics, don't crash | ||
| if verbose: | ||
| print( | ||
| "[nccl_offload] official destroy_model_parallel() raised " |
There was a problem hiding this comment.
It’s better to raise an error here.
since it is responsible for clearing Megatron’s residual global states. If it fails, it may lead to stale state issues during reload, the upper layer may assume that the GPU communicators have been released, but indeed some groups may have failed to be destroyed. This can lead to OOM.
| except Exception as exc: # noqa: BLE001 - surface diagnostics, don't crash | ||
| if verbose: | ||
| print( | ||
| "[nccl_offload] manual: residual reset via " |
There was a problem hiding this comment.
It’s better to raise an error here.
since it is responsible for clearing Megatron’s residual global states. If it fails, it may lead to stale state issues during reload, the upper layer may assume that the GPU communicators have been released, but indeed some groups may have failed to be destroyed. This can lead to OOM.
Summary
Adds
nccl_offloadhelpers to destroy/reload Megatron/NCCL state, plus unit and GPU tests on a Task 3-only branch.Changes
nemo_rl/models/megatron/nccl_offload.py— API for NCCL teardown and reloadtests/unit/models/megatron/test_nccl_offload.py— mock/CPU unit teststests/unit/models/megatron/test_nccl_offload_gpu.py+_nccl_offload_gpu_worker.py— optional GPU testsTesting
pytest ... test_nccl_offload.py(39 passed)test_nccl_offload_gpu.py(e.g. on Vast) with--mcore-onlyas needed