feat: Megatron NCCL destroy/reload helpers and tests (Task 3) by Yankail97 · Pull Request #4 · rlops/RL

Yankail97 · 2026-04-26T08:30:00Z

Summary

Adds nccl_offload helpers to destroy/reload Megatron/NCCL state, plus unit and GPU tests on a Task 3-only branch.

Changes

nemo_rl/models/megatron/nccl_offload.py — API for NCCL teardown and reload
tests/unit/models/megatron/test_nccl_offload.py — mock/CPU unit tests
tests/unit/models/megatron/test_nccl_offload_gpu.py + _nccl_offload_gpu_worker.py — optional GPU tests

Testing

Local: pytest ... test_nccl_offload.py (39 passed)
GPU: run test_nccl_offload_gpu.py (e.g. on Vast) with --mcore-only as needed

…ASK 3 Introduces nemo_rl.models.megatron.nccl_offload with two public helpers: - destroy_megatron_nccl_groups(): walks megatron.core.parallel_state, collects every NCCL ProcessGroup (TP/PP/DP/CP/EP and hybrid list-valued holders), deduplicates by id, skips Gloo backends, calls torch.distributed.destroy_process_group on each, then nulls out the parallel_state global handles so model_parallel_is_initialized() returns False. Captures a parallel-config snapshot (tp/pp/cp/ep + optional VPP) for a subsequent reload call. Returns per-call stats (initialized_before, num_groups_destroyed, vram_freed_bytes, state_snapshot) for observability. - reload_megatron_nccl_groups(state_snapshot): rebuilds NCCL groups via parallel_state.initialize_model_parallel using the captured snapshot, followed by a barrier. Idempotent; raises if called with an empty snapshot. Both helpers are idempotent, tolerate per-group destroy failures, and use introspection (rather than a hardcoded global name list) so they adapt to Megatron-core version changes. These helpers are infrastructure only. They are not wired into any training loop in this change; TASK 5 (Feature 11: DO_TIME_SHARING flag + grpo.py branch) is responsible for invoking them at the correct phase of the partial-overlap lifecycle. See plans/nemorl-port-plan.md Feature 11 and task3-plan.md for context. Tests: 27 mock-based unit tests in tests/unit/models/megatron/test_nccl_offload.py covering group collection, dedup, Gloo filtering, snapshot round-trip, idempotency, per-group failure tolerance, VPP handling, and the destroy/reload API contract. All pass without GPU or a real Megatron install. Real GPU validation (Gate 2.5: dp=1/tp=2, 3+ destroy/reload cycles with actual VRAM delta measurement) is out of scope for this commit and will be added alongside the Gate 2.5 integration tests. Made-with: Cursor

…tegration tests Extends TASK 3 to the full Feature 11 / Gate 2.5 scope required by the NeMo RL port plan. Two additions on top of the initial manual helper: 1. Official-mode fallback in destroy_megatron_nccl_groups() The port plan mandates a fallback path using Megatron's official parallel_state.destroy_model_parallel() that operators can switch to if Gate 2.5 surfaces issues with the default manual path (VRAM not released, stale handles, instability across cycles). - New Literal type NcclOffloadMode = "manual" | "official" - destroy_megatron_nccl_groups(mode=...) selects the path - Stats now carry a "mode" field and use -1 as the sentinel for num_groups_destroyed under the official path (that API does not expose a count) - Official path falls through to the manual path if the Megatron build is too old to expose destroy_model_parallel, and resets globals defensively if the official API raises - reload_megatron_nccl_groups() is mode-agnostic: both paths reload via parallel_state.initialize_model_parallel(**snapshot) - 8 new mock-based unit tests covering each official-mode branch, invalid mode rejection, and the manual default 2. GPU multi-process integration tests (Gate 2.5 NCCL lifecycle) Real-GPU coverage of the destroy/reload cycle. The tests spawn torchrun as a subprocess to drive a 2-process Megatron NCCL lifecycle and verify: - parallel_state reports !initialized after destroy - VRAM does not grow across destroy - AllReduce succeeds after each reload - Post-reload VRAM is stable across 3+ cycles (tolerance: 200 MiB — catches real leaks, avoids allocator-caching flakiness) Test layout: - tests/unit/models/megatron/_nccl_offload_gpu_worker.py — the per-rank script invoked by torchrun. Only rank 0 writes a JSON report; a non-zero exit is forwarded on failure. - tests/unit/models/megatron/test_nccl_offload_gpu.py — the pytest wrapper. Auto-skips when CUDA/torchrun/megatron.core is missing, so CI and MacBook-local pytest runs are not broken. Covers both mode="manual" and mode="official" so the fallback path is validated alongside the default. Test status: 35 unit tests pass locally (without GPU), 4 GPU integration tests skip cleanly locally and are expected to run green on the Vast.ai 2x GPU environment described in docs/vastai-setup.md. Made-with: Cursor

…destroy Fixes the real root cause that blocked Gate 2.5 on RTX 5090. The manual destroy path freed NCCL communicator buffers correctly but left residual non-group parallel_state globals set, causing the subsequent reload to crash deep inside Megatron: AssertionError: global memory buffer is already initialized nccl_offload.py:288 reload_megatron_nccl_groups() parallel_state.py:1925 _set_global_memory_buffer() Root cause: _reset_parallel_state_globals only nulls attrs matching the _*_GROUP* heuristic, but Megatron also tracks ~25 non-group module globals (_GLOBAL_MEMORY_BUFFER, _MPU_*_RANK/WORLD_SIZE, _VIRTUAL_PIPELINE_*_RANK, _CONTEXT_PARALLEL_GLOBAL_RANKS, etc.) that must be reset for a subsequent initialize_model_parallel() to succeed. Enumerating that list ourselves would drift every Megatron upgrade. 1. Manual mode now delegates residual state reset to Megatron After the per-group destroy_process_group loop (manual mode's unique value-add — immediate NCCL buffer release) and the group-attr walker, _destroy_manual invokes parallel_state.destroy_model_parallel() to let Megatron authoritatively reset its own non-group state. Wrapped in hasattr + try/except so a missing or mis-behaving API never masks the per-group destroys we already succeeded at. 2. Mock UT locks the new contract (4 new tests) TestManualModeResidualReset verifies: - destroy_model_parallel() is invoked from the manual path - ordering: per-group destroys run BEFORE the residual reset (else the refs get nulled and destroy_process_group has nothing to free) - a raising destroy_model_parallel() does not abort manual mode - missing destroy_model_parallel() (ancient Megatron) is tolerated Also removed the now-incorrect assert_not_called() from the existing test_default_mode_is_manual. 3. GPU test wrapper surfaces worker tracebacks on failure Previously, when the torchrun subprocess died, pytest only saw the generic ChildFailedError with 'error_file: <N/A>' — completely hiding the worker's real Python traceback. _run_worker now reads the worker's JSON report (written by _nccl_offload_gpu_worker.py) and surfaces failure.type, failure.message, and failure.traceback into the AssertionError. Also added _resolve_torchrun() that finds a torchrun whose Python actually has megatron.core, since NeMo-RL installs Megatron into a per-worker uv venv (/opt/ray_venvs/...MegatronPolicyWorker/) separate from the main venv's torchrun. Resolution order: NCCL_OFFLOAD_TORCHRUN env (override), NCCL_OFFLOAD_MCORE_VENV env, autodiscover /opt/ray_venvs/*MegatronPolicyWorker*, then PATH. Test status: - 39 mock unit tests pass (incl. 4 new residual-reset contract tests) - 4 GPU integration tests pass on 2x RTX 5090 (manual 1-cycle, manual 3-cycle stability, official 1-cycle, official 3-cycle stability; VRAM drift across cycles = 0 bytes) Made-with: Cursor

Yankail97 · 2026-04-26T17:51:55Z

+
+
+def destroy_megatron_nccl_groups(
+    *,


We have two modes(manual/official) when destroying NCCL group. Manual mode calls _destroy_manual and official mode calls _destroy_official.

_destroy_manual calls PyTorch destroy_process_group.
_destroy_official calls Megatron destroy_model_parallel to destroy the group(pointer).

Both paths are verified in test_nccl_offload_gpu.py

TianyeGGBond · 2026-05-01T22:41:09Z

+            if verbose:
+                print(
+                    f"[nccl_offload] destroy failed for {name}: "
+                    f"{type(exc).__name__}: {exc}"


It’s better to raise an error here.
torch.distributed.destroy_process_group(pg) is the critical operation, if it fails, it means the NCCL communicator has not been properly released.

TianyeGGBond · 2026-05-01T22:54:16Z

+    except Exception as exc:  # noqa: BLE001 - surface diagnostics, don't crash
+        if verbose:
+            print(
+                "[nccl_offload] official destroy_model_parallel() raised "


It’s better to raise an error here.
since it is responsible for clearing Megatron’s residual global states. If it fails, it may lead to stale state issues during reload, the upper layer may assume that the GPU communicators have been released, but indeed some groups may have failed to be destroyed. This can lead to OOM.

TianyeGGBond · 2026-05-01T22:54:43Z

+        except Exception as exc:  # noqa: BLE001 - surface diagnostics, don't crash
+            if verbose:
+                print(
+                    "[nccl_offload] manual: residual reset via "


It’s better to raise an error here.
since it is responsible for clearing Megatron’s residual global states. If it fails, it may lead to stale state issues during reload, the upper layer may assume that the GPU communicators have been released, but indeed some groups may have failed to be destroyed. This can lead to OOM.

Yankai Liu added 3 commits April 26, 2026 00:57

Yankail97 marked this pull request as draft April 26, 2026 08:38

Yankail97 commented Apr 26, 2026

View reviewed changes

TianyeGGBond reviewed May 1, 2026

View reviewed changes

TianyeGGBond mentioned this pull request May 5, 2026

Feat/nemo rl task5 6 7 rlix progress hooks #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Megatron NCCL destroy/reload helpers and tests (Task 3)#4

feat: Megatron NCCL destroy/reload helpers and tests (Task 3)#4
Yankail97 wants to merge 3 commits intorlops:mainfrom
Yankail97:feat/task3-nccl-offload

Yankail97 commented Apr 26, 2026 •

edited

Loading

Uh oh!

Yankail97 Apr 26, 2026

Uh oh!

TianyeGGBond May 1, 2026

Uh oh!

TianyeGGBond May 1, 2026

Uh oh!

TianyeGGBond May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def destroy_megatron_nccl_groups(
		*,

Conversation

Yankail97 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

Yankail97 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

TianyeGGBond May 1, 2026

Choose a reason for hiding this comment

Uh oh!

TianyeGGBond May 1, 2026

Choose a reason for hiding this comment

Uh oh!

TianyeGGBond May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yankail97 commented Apr 26, 2026 •

edited

Loading