Skip to content

Reservations capacity assesment#1057

Merged
jamOne- merged 35 commits intoAI-Hypercomputer:mainfrom
jamOne-:dynamic-reservation-capacity
Feb 25, 2026
Merged

Reservations capacity assesment#1057
jamOne- merged 35 commits intoAI-Hypercomputer:mainfrom
jamOne-:dynamic-reservation-capacity

Conversation

@jamOne-
Copy link
Collaborator

@jamOne- jamOne- commented Feb 13, 2026

Description

The change enhances how XPK handles reservations during cluster creation.

Important: if the new validation logic is blocking you in an unexpected way, please let us know!
To disable the new behavior use the RESERVATIONS_VALIDATION_ENABLED=False environment variable when running xpk:

RESERVATIONS_VALIDATION_ENABLED=False xpk ...

Key Improvements for Users:

  • Smarter Reservation Allocation: Instead of blindly cycling through provided
    reservations, XPK now dynamically assesses the actual available capacity (slices/blocks)
    within each reservation. It allocates these available slots to the requested node pools,
    ensuring better utilization.
  • Pre-flight Validation: XPK now validates that your reservations match the requested
    system configuration (e.g., correct machine type, accelerator type) and have sufficient
    healthy capacity before starting the cluster creation process. This prevents failures
    later in the process.
  • Enhanced Super-Slicing Support: For super-slicing workloads, the tool can now
    intelligently identify and target healthy sub-blocks within a reservation block.

Next step: we'll make --num-slices / --num-cubes optional when using the reservation capacity type.

Issue

Testing

Manual testing scenarios:

  1. New cluster, super-slicing, 3 slices, reservations passed: subblock, block.
  2. Existing cluster with 3 slices, super-slicing, +1 slice, reservations passed: partially taken subblock, subblock (xpk should fail).
  3. Existing cluster with 3 slices, super-slicing, +1 slice, reservations passed: subblock.
  4. Existing cluster with 4 slices, super-slicing, +1 slice, reservations passed: reservation.
  5. New cluster, GPU, --num-nodes=9999, reservations passed: GPU reservation (xpk should fail).
  6. New cluster GPU slice, reservations passed: TPU reservation (xpk should fail).
  7. New cluster GPU, 2 nodes, reservations passed: GPU reservation.
  8. New cluster TPU, 2 slices, reservations passed: TPU reservation.
  9. New cluster TPU, 2 slices, reservations passed: different TPU machine reservation.
  10. New cluster CPU reservations.

@jamOne- jamOne- changed the title Dynamic reservation capacity Dynamic reservation capacity calculation Feb 19, 2026
@jamOne- jamOne- changed the title Dynamic reservation capacity calculation Reservations capacity assesment Feb 19, 2026
@jamOne- jamOne- added the release-features features label Feb 20, 2026
@jamOne- jamOne- marked this pull request as ready for review February 20, 2026 12:22
@jamOne- jamOne- added this pull request to the merge queue Feb 25, 2026
Merged via the queue into AI-Hypercomputer:main with commit afa1810 Feb 25, 2026
21 checks passed
@jamOne- jamOne- deleted the dynamic-reservation-capacity branch February 25, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants