[fix] Fix placement group bundle ordering for inference engines by SumanthRH · Pull Request #1308 · NovaSky-AI/SkyRL

SumanthRH · 2026-03-10T20:21:36Z

Ray placement groups don't guarantee bundle order - bundles on the same node may not have consecutive indices.

This PR introduces a ResolvedPlacementGroup wrapper that computes GPU-aware reordered bundle indices (sorted by node_id, gpu_id) as a cached property, and use them for all bundle indexing in both inference engine and trainer placement.

Ray placement groups don't guarantee bundle order - bundles on the same node may not have consecutive indices. Introduce SkyRLPlacementGroup wrapper that computes GPU-aware reordered bundle indices (sorted by node_id, gpu_id) as a cached property, and use them for all bundle indexing in both inference engine and trainer placement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

- Update ServerGroup to accept SkyRLPlacementGroup consistently - Change VLLMServerActor to accept explicit bundle_indices list instead of computing contiguous range from start_bundle_idx - Fix type annotation in trainer.py for colocate_pg - Update tests to wrap raw PlacementGroups in SkyRLPlacementGroup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…ping All callers now wrap raw PlacementGroups in SkyRLPlacementGroup at creation time. PPORayActorGroup, create_ray_wrapped_inference_engines, and flash_rl_engine assert the type instead of silently auto-wrapping, which avoids redundant reordering computation when the same PG is shared across multiple actor groups. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…_offload Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a SkyRLPlacementGroup wrapper class to manage Ray placement groups, ensuring GPU-aware ordering of resource bundles. The change addresses the issue that Ray's native placement groups do not guarantee bundle order, which is crucial for consistent GPU resource allocation in distributed training and inference. The code updates replace direct PlacementGroup usage with SkyRLPlacementGroup across various modules, including inference engines, trainers, and server groups. This involves extracting the raw Ray placement group and its reordered bundle indices from the wrapper, and modifying scheduling strategies and bundle index assignments to utilize these reordered indices. Additionally, server actor initialization now accepts a list of bundle_indices instead of a starting index, providing more precise control over resource allocation.

_{Note: Security Review did not run due to the size of the PR.}

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

erictang000 · 2026-03-11T21:38:00Z

@SumanthRH this pretty much lgtm, but let's merge #1300 first since there's some overlap/merge conflicts to resolve?

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…ndant GPU probing SkyRLPlacementGroup now stores the full (bundle_idx, node_id, gpu_id) mapping via lazy _get_bundle_placement(), exposing bundle_gpu_ids and bundle_node_ids properties. This eliminates get_gpu_ids_for_pg_bundles and get_reordered_bundle_indices (dead code), and simplifies mp backend GPU ID lookup in both old (ray_wrapped_inference_engine) and new (ServerGroup) stacks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kouroshHakha · 2026-03-13T01:31:16Z

skyrl/backends/skyrl_train/inference_servers/server_group.py

        num_servers: int,
        start_port: int = 8000,
-        placement_group: Optional[PlacementGroup] = None,
+        placement_group: Optional[SkyRLPlacementGroup] = None,


Can we call this OrderedPlacementGroup?

Hm I don't think that name would be right given that the underlying pg object is still "unordered" and the wrapper class here is simply resolving the physical ordering of the bundles. OrderedPlacementGroup makes it sound like we have additional guarantees on the pg object

I've renamed to ResolvedPlacementGroup - this should be better.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a ResolvedPlacementGroup wrapper to address incorrect bundle ordering in Ray placement groups, which is a critical fix for ensuring correct GPU-aware resource allocation. The new wrapper cleverly caches bundle placement information, improving both correctness and efficiency. The changes are consistently applied across the codebase, including inference engines, trainers, and tests. The overall implementation is robust and well-designed. I have one minor suggestion to improve code clarity in server_group.py.

gemini-code-assist · 2026-03-13T21:55:18Z

skyrl/backends/skyrl_train/inference_servers/server_group.py

        self._external_pg = placement_group

+        # Extract the raw PG, reordered indices, and GPU IDs from ResolvedPlacementGroup.
+        if placement_group is not None:
+            self._external_pg = placement_group.pg
+            self._reordered_bundle_indices = placement_group.reordered_bundle_indices
+            self._bundle_gpu_ids = placement_group.bundle_gpu_ids
+        else:
+            self._external_pg = None
+            self._reordered_bundle_indices = None
+            self._bundle_gpu_ids = None


This block can be simplified to improve clarity and remove redundancy. The assignment self._external_pg = placement_group on line 90 is immediately overwritten within the if block, and the else block repeats the default None assignment. Initializing these attributes to None before the if block would make the logic more straightforward.

Suggested change

self._external_pg = placement_group

# Extract the raw PG, reordered indices, and GPU IDs from ResolvedPlacementGroup.

if placement_group is not None:

self._external_pg = placement_group.pg

self._reordered_bundle_indices = placement_group.reordered_bundle_indices

self._bundle_gpu_ids = placement_group.bundle_gpu_ids

else:

self._external_pg = None

self._reordered_bundle_indices = None

self._bundle_gpu_ids = None

self._external_pg = None

self._reordered_bundle_indices = None

self._bundle_gpu_ids = None

if placement_group is not None:

self._external_pg = placement_group.pg

self._reordered_bundle_indices = placement_group.reordered_bundle_indices

self._bundle_gpu_ids = placement_group.bundle_gpu_ids

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH and others added 6 commits March 10, 2026 20:16

x

cdffe3c

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

2d1b497

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

[fix] Fix stale PlacementGroup type annotation in skyrl_train_backend

6c4e366

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SumanthRH marked this pull request as ready for review March 11, 2026 00:16

lint

30e94c6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

This comment was marked as resolved.

Sign in to view

[fix] Fix remaining raw PlacementGroup usages in test_worker_dispatch…

d508a20

…_offload Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

fix skyrltrainconfig issue in unused test script

77c2d45

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH requested a review from erictang000 March 11, 2026 01:21

SumanthRH marked this pull request as draft March 11, 2026 23:28

SumanthRH and others added 3 commits March 11, 2026 23:28

Merge remote-tracking branch 'origin/main' into fix-placement

583491d

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

lint

2d28316

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

kouroshHakha reviewed Mar 13, 2026

View reviewed changes

SumanthRH added 3 commits March 13, 2026 18:06

rename to ResolvedPlacementGroup

4fecfa2

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'origin/main' into fix-placement

fbf8756

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

docstrings

090b618

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review March 13, 2026 21:53

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

SumanthRH added 2 commits March 13, 2026 22:27

docstring

7bd20fe

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

add doc

3ff5181

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Fix placement group bundle ordering for inference engines#1308

[fix] Fix placement group bundle ordering for inference engines#1308
SumanthRH wants to merge 17 commits intomainfrom
fix-placement

SumanthRH commented Mar 10, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

erictang000 commented Mar 11, 2026

Uh oh!

kouroshHakha Mar 13, 2026

Uh oh!

SumanthRH Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SumanthRH commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

erictang000 commented Mar 11, 2026

Uh oh!

kouroshHakha Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SumanthRH commented Mar 10, 2026 •

edited

Loading