Skip to content

feat: Megatron NCCL destroy/reload helpers and tests (Task 3)#4

Draft
Yankail97 wants to merge 3 commits intorlops:mainfrom
Yankail97:feat/task3-nccl-offload
Draft

feat: Megatron NCCL destroy/reload helpers and tests (Task 3)#4
Yankail97 wants to merge 3 commits intorlops:mainfrom
Yankail97:feat/task3-nccl-offload

Conversation

@Yankail97
Copy link
Copy Markdown

@Yankail97 Yankail97 commented Apr 26, 2026

Summary

Adds nccl_offload helpers to destroy/reload Megatron/NCCL state, plus unit and GPU tests on a Task 3-only branch.

Changes

  • nemo_rl/models/megatron/nccl_offload.py — API for NCCL teardown and reload
  • tests/unit/models/megatron/test_nccl_offload.py — mock/CPU unit tests
  • tests/unit/models/megatron/test_nccl_offload_gpu.py + _nccl_offload_gpu_worker.py — optional GPU tests

Testing

  • Local: pytest ... test_nccl_offload.py (39 passed)
  • GPU: run test_nccl_offload_gpu.py (e.g. on Vast) with --mcore-only as needed

Yankai Liu added 3 commits April 26, 2026 00:57
…ASK 3

Introduces nemo_rl.models.megatron.nccl_offload with two public helpers:

  - destroy_megatron_nccl_groups(): walks megatron.core.parallel_state,
    collects every NCCL ProcessGroup (TP/PP/DP/CP/EP and hybrid list-valued
    holders), deduplicates by id, skips Gloo backends, calls
    torch.distributed.destroy_process_group on each, then nulls out the
    parallel_state global handles so model_parallel_is_initialized() returns
    False. Captures a parallel-config snapshot (tp/pp/cp/ep + optional VPP)
    for a subsequent reload call. Returns per-call stats (initialized_before,
    num_groups_destroyed, vram_freed_bytes, state_snapshot) for
    observability.

  - reload_megatron_nccl_groups(state_snapshot): rebuilds NCCL groups via
    parallel_state.initialize_model_parallel using the captured snapshot,
    followed by a barrier. Idempotent; raises if called with an empty
    snapshot.

Both helpers are idempotent, tolerate per-group destroy failures, and use
introspection (rather than a hardcoded global name list) so they adapt to
Megatron-core version changes.

These helpers are infrastructure only. They are not wired into any training
loop in this change; TASK 5 (Feature 11: DO_TIME_SHARING flag + grpo.py
branch) is responsible for invoking them at the correct phase of the
partial-overlap lifecycle. See plans/nemorl-port-plan.md Feature 11 and
task3-plan.md for context.

Tests: 27 mock-based unit tests in tests/unit/models/megatron/test_nccl_offload.py
covering group collection, dedup, Gloo filtering, snapshot round-trip,
idempotency, per-group failure tolerance, VPP handling, and the
destroy/reload API contract. All pass without GPU or a real Megatron
install.

Real GPU validation (Gate 2.5: dp=1/tp=2, 3+ destroy/reload cycles with
actual VRAM delta measurement) is out of scope for this commit and will
be added alongside the Gate 2.5 integration tests.

Made-with: Cursor
…tegration tests

Extends TASK 3 to the full Feature 11 / Gate 2.5 scope required by the
NeMo RL port plan. Two additions on top of the initial manual helper:

1. Official-mode fallback in destroy_megatron_nccl_groups()

   The port plan mandates a fallback path using Megatron's official
   parallel_state.destroy_model_parallel() that operators can switch to
   if Gate 2.5 surfaces issues with the default manual path (VRAM not
   released, stale handles, instability across cycles).

   - New Literal type NcclOffloadMode = "manual" | "official"
   - destroy_megatron_nccl_groups(mode=...) selects the path
   - Stats now carry a "mode" field and use -1 as the sentinel for
     num_groups_destroyed under the official path (that API does not
     expose a count)
   - Official path falls through to the manual path if the Megatron
     build is too old to expose destroy_model_parallel, and resets
     globals defensively if the official API raises
   - reload_megatron_nccl_groups() is mode-agnostic: both paths reload
     via parallel_state.initialize_model_parallel(**snapshot)
   - 8 new mock-based unit tests covering each official-mode branch,
     invalid mode rejection, and the manual default

2. GPU multi-process integration tests (Gate 2.5 NCCL lifecycle)

   Real-GPU coverage of the destroy/reload cycle. The tests spawn
   torchrun as a subprocess to drive a 2-process Megatron NCCL lifecycle
   and verify:

   - parallel_state reports !initialized after destroy
   - VRAM does not grow across destroy
   - AllReduce succeeds after each reload
   - Post-reload VRAM is stable across 3+ cycles
     (tolerance: 200 MiB — catches real leaks, avoids allocator-caching
     flakiness)

   Test layout:
   - tests/unit/models/megatron/_nccl_offload_gpu_worker.py — the
     per-rank script invoked by torchrun. Only rank 0 writes a JSON
     report; a non-zero exit is forwarded on failure.
   - tests/unit/models/megatron/test_nccl_offload_gpu.py — the pytest
     wrapper. Auto-skips when CUDA/torchrun/megatron.core is missing,
     so CI and MacBook-local pytest runs are not broken.

   Covers both mode="manual" and mode="official" so the fallback path
   is validated alongside the default.

Test status: 35 unit tests pass locally (without GPU), 4 GPU integration
tests skip cleanly locally and are expected to run green on the Vast.ai
2x GPU environment described in docs/vastai-setup.md.

Made-with: Cursor
…destroy

Fixes the real root cause that blocked Gate 2.5 on RTX 5090. The manual
destroy path freed NCCL communicator buffers correctly but left residual
non-group parallel_state globals set, causing the subsequent reload to
crash deep inside Megatron:

    AssertionError: global memory buffer is already initialized
      nccl_offload.py:288  reload_megatron_nccl_groups()
      parallel_state.py:1925  _set_global_memory_buffer()

Root cause: _reset_parallel_state_globals only nulls attrs matching the
_*_GROUP* heuristic, but Megatron also tracks ~25 non-group module
globals (_GLOBAL_MEMORY_BUFFER, _MPU_*_RANK/WORLD_SIZE,
_VIRTUAL_PIPELINE_*_RANK, _CONTEXT_PARALLEL_GLOBAL_RANKS, etc.) that
must be reset for a subsequent initialize_model_parallel() to succeed.
Enumerating that list ourselves would drift every Megatron upgrade.

1. Manual mode now delegates residual state reset to Megatron

   After the per-group destroy_process_group loop (manual mode's unique
   value-add — immediate NCCL buffer release) and the group-attr walker,
   _destroy_manual invokes parallel_state.destroy_model_parallel() to
   let Megatron authoritatively reset its own non-group state.

   Wrapped in hasattr + try/except so a missing or mis-behaving API
   never masks the per-group destroys we already succeeded at.

2. Mock UT locks the new contract (4 new tests)

   TestManualModeResidualReset verifies:
   - destroy_model_parallel() is invoked from the manual path
   - ordering: per-group destroys run BEFORE the residual reset (else
     the refs get nulled and destroy_process_group has nothing to free)
   - a raising destroy_model_parallel() does not abort manual mode
   - missing destroy_model_parallel() (ancient Megatron) is tolerated

   Also removed the now-incorrect assert_not_called() from the existing
   test_default_mode_is_manual.

3. GPU test wrapper surfaces worker tracebacks on failure

   Previously, when the torchrun subprocess died, pytest only saw the
   generic ChildFailedError with 'error_file: <N/A>' — completely
   hiding the worker's real Python traceback. _run_worker now reads
   the worker's JSON report (written by _nccl_offload_gpu_worker.py)
   and surfaces failure.type, failure.message, and failure.traceback
   into the AssertionError.

   Also added _resolve_torchrun() that finds a torchrun whose Python
   actually has megatron.core, since NeMo-RL installs Megatron into a
   per-worker uv venv (/opt/ray_venvs/...MegatronPolicyWorker/)
   separate from the main venv's torchrun. Resolution order:
   NCCL_OFFLOAD_TORCHRUN env (override), NCCL_OFFLOAD_MCORE_VENV env,
   autodiscover /opt/ray_venvs/*MegatronPolicyWorker*, then PATH.

Test status:
- 39 mock unit tests pass (incl. 4 new residual-reset contract tests)
- 4 GPU integration tests pass on 2x RTX 5090
  (manual 1-cycle, manual 3-cycle stability, official 1-cycle,
   official 3-cycle stability; VRAM drift across cycles = 0 bytes)

Made-with: Cursor
@Yankail97 Yankail97 marked this pull request as draft April 26, 2026 08:38


def destroy_megatron_nccl_groups(
*,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two modes(manual/official) when destroying NCCL group. Manual mode calls _destroy_manual and official mode calls _destroy_official.

_destroy_manual calls PyTorch destroy_process_group.
_destroy_official calls Megatron destroy_model_parallel to destroy the group(pointer).

Both paths are verified in test_nccl_offload_gpu.py

if verbose:
print(
f"[nccl_offload] destroy failed for {name}: "
f"{type(exc).__name__}: {exc}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s better to raise an error here.
torch.distributed.destroy_process_group(pg) is the critical operation, if it fails, it means the NCCL communicator has not been properly released.

except Exception as exc: # noqa: BLE001 - surface diagnostics, don't crash
if verbose:
print(
"[nccl_offload] official destroy_model_parallel() raised "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s better to raise an error here.
since it is responsible for clearing Megatron’s residual global states. If it fails, it may lead to stale state issues during reload, the upper layer may assume that the GPU communicators have been released, but indeed some groups may have failed to be destroyed. This can lead to OOM.

except Exception as exc: # noqa: BLE001 - surface diagnostics, don't crash
if verbose:
print(
"[nccl_offload] manual: residual reset via "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s better to raise an error here.
since it is responsible for clearing Megatron’s residual global states. If it fails, it may lead to stale state issues during reload, the upper layer may assume that the GPU communicators have been released, but indeed some groups may have failed to be destroyed. This can lead to OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants