Skip to content

Add Kueue default config for Dynamic Slicing#5693

Open
FIoannides wants to merge 4 commits into
GoogleCloudPlatform:developfrom
FIoannides:feature/dynamic-slicing-kueue
Open

Add Kueue default config for Dynamic Slicing#5693
FIoannides wants to merge 4 commits into
GoogleCloudPlatform:developfrom
FIoannides:feature/dynamic-slicing-kueue

Conversation

@FIoannides
Copy link
Copy Markdown
Contributor

@FIoannides FIoannides commented May 20, 2026

Introduce enable_dynamic_slicing_for_tpus, adding:

the two more default Kueue configs,
Controller overrides defaults,
Validations.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the GKE TPU workload management by integrating support for Kueue dynamic slicing. It updates the infrastructure configuration to allow users to enable dynamic slicing via new variables, introduces dedicated configuration templates, and adds robust validation logic to ensure that the required cluster settings are correctly configured for this feature.

Highlights

  • Dynamic Slicing Support: Added support for GKE TPU dynamic slicing in Kueue configurations, allowing for more flexible resource allocation.
  • Configuration Templates: Introduced new Terraform templates for Kueue configuration to support dynamic slicing, both independently and in combination with Pathways.
  • Validation Logic: Implemented Terraform preconditions to ensure correct configuration of accelerator topology mode, slice controller, and machine types when dynamic slicing is enabled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions Bot added the external PR from external contributor label May 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for TPU dynamic slicing in GKE by adding new Kueue configuration templates and updating the kubectl-apply module with selection logic and validation rules. Review feedback identifies a need to broaden the machine_type validation regex to include ct types, suggests refactoring nested ternary expressions into a map lookup for improved maintainability, and recommends removing hardcoded TPU partition labels in the templates to support varied topologies.

Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/main.tf Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for TPU Dynamic Slicing across GKE blueprints and the kubectl-apply management module, adding new Kueue configuration templates and validation logic. Feedback identifies that some blueprint settings reference non-existent module outputs, which should be resolved by defining them in the source modules to support automatic wiring. Additionally, the regex for TPU machine type validation is too restrictive and needs expansion, while hardcoded node labels in the templates should be parameterized for reusability. A redundant try() function in the Terraform configuration was also noted for removal.

Comment thread examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml Outdated
Comment thread examples/gke-tpu-v6e/gke-tpu-v6e-advanced.yaml Outdated
Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/main.tf Outdated
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 8237060 to 48a7548 Compare May 20, 2026 13:34
Comment thread examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml
Comment thread modules/management/kubectl-apply/variables.tf
Comment thread modules/management/kubectl-apply/main.tf Outdated
Comment thread modules/management/kubectl-apply/main.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf
Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread examples/gke-tpu-v6e/gke-tpu-v6e-advanced.yaml
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch 6 times, most recently from 1bbae7d to 7a8e804 Compare May 20, 2026 15:50
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 7a8e804 to b646e94 Compare May 20, 2026 15:55
Comment thread modules/management/kubectl-apply/variables.tf
@FIoannides FIoannides marked this pull request as ready for review May 21, 2026 10:17
@FIoannides FIoannides requested a review from a team as a code owner May 21, 2026 10:17
@Neelabh94 Neelabh94 added the release-improvements Added to release notes under the "Improvements" heading. label May 26, 2026
Comment thread examples/gke-tpu-7x/gke-tpu-7x.yaml
Comment thread modules/management/kubectl-apply/variables.tf
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 36a45df to 091f67c Compare May 29, 2026 11:05
@FIoannides FIoannides requested a review from Neelabh94 May 29, 2026 11:08
@jamOne-
Copy link
Copy Markdown
Contributor

jamOne- commented Jun 1, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Dynamic Slicing for TPUs in GKE clusters by adding configuration variables, updating the Kueue installation module, and introducing new Kueue configuration templates. The review feedback highlights a few critical issues: first, the use of double dollar signs ($$) in the dynamic slicing template escapes interpolation and must be corrected to single dollar signs ($); second, the TPU partition topology label is hardcoded to 4x4x4 in both new templates, which should be made dynamic to support other topologies; and finally, the hardcoded validation check for tpu7x machine types in variables.tf should be removed to avoid unnecessary maintenance overhead.

Comment thread modules/management/kubectl-apply/variables.tf Outdated
@Neelabh94
Copy link
Copy Markdown
Contributor

/gcbrun

1 similar comment
@Neelabh94
Copy link
Copy Markdown
Contributor

/gcbrun

@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 511ec7e to 8ea7784 Compare June 2, 2026 07:37
@Neelabh94
Copy link
Copy Markdown
Contributor

/gcbrun

Neelabh94
Neelabh94 previously approved these changes Jun 2, 2026
Copy link
Copy Markdown
Contributor

@Neelabh94 Neelabh94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit comments, otherwise LGTM

Comment thread examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml Outdated
Comment thread examples/gke-tpu-7x/gke-tpu-7x.yaml Outdated
@Neelabh94
Copy link
Copy Markdown
Contributor

/gcbrun

@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from f3a4765 to 8c0c91d Compare June 2, 2026 13:33
@Neelabh94
Copy link
Copy Markdown
Contributor

/gcbrun

Neelabh94
Neelabh94 previously approved these changes Jun 2, 2026
Copy link
Copy Markdown
Contributor

@Neelabh94 Neelabh94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread examples/gke-tpu-7x/gke-tpu-7x.yaml Outdated
@Neelabh94
Copy link
Copy Markdown
Contributor

Neelabh94 commented Jun 3, 2026

The PR test for TPU7x also failed with the following error

Step #1 - "gke-tpu-7x": Error: module "workload-manager-install" uses module "gke-tpu-7x-pool", but matching setting and outputs were not found. This may be because the value is set explicitly or set by a prior used module
Step #1 - "gke-tpu-7x": 168:     use: [gke-tpu-7x-cluster, gke-tpu-7x-pool]
Step #1 - "gke-tpu-7x":                                    ^
Step #1 - "gke-tpu-7x": 2026-06-03T08:36:33Z: One or more blueprint validators has failed. See messages above for suggested
Step #1 - "gke-tpu-7x": actions. General troubleshooting guidance and instructions for configuring validators are shown below.
Step #1 - "gke-tpu-7x": 
Step #1 - "gke-tpu-7x": - https://goo.gle/hpc-toolkit-troubleshooting
Step #1 - "gke-tpu-7x": - https://goo.gle/hpc-toolkit-validation
Step #1 - "gke-tpu-7x": 
Step #1 - "gke-tpu-7x": Validators can be silenced or treated as warnings or errors:
Step #1 - "gke-tpu-7x": 
Step #1 - "gke-tpu-7x": - https://goo.gle/hpc-toolkit-validation-levels
Step #1 - "gke-tpu-7x": 2026-06-03T08:36:33Z: Validation failed due to the issues listed above```


Link for complete logs: https://pantheon.corp.google.com/cloud-build/builds/46acf185-373e-42bd-837d-eeac3c60ca90?project=hpc-toolkit-dev&e=-13802955

Please resolve this as well.

@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch 3 times, most recently from 060e5d0 to 4046060 Compare June 3, 2026 13:52
@FIoannides
Copy link
Copy Markdown
Contributor Author

The PR test for TPU7x also failed with the following error

Step #1 - "gke-tpu-7x": Error: module "workload-manager-install" uses module "gke-tpu-7x-pool", but matching setting and outputs were not found. This may be because the value is set explicitly or set by a prior used module
Step #1 - "gke-tpu-7x": 168:     use: [gke-tpu-7x-cluster, gke-tpu-7x-pool]
Step #1 - "gke-tpu-7x":                                    ^
Step #1 - "gke-tpu-7x": 2026-06-03T08:36:33Z: One or more blueprint validators has failed. See messages above for suggested
Step #1 - "gke-tpu-7x": actions. General troubleshooting guidance and instructions for configuring validators are shown below.
Step #1 - "gke-tpu-7x": 
Step #1 - "gke-tpu-7x": - https://goo.gle/hpc-toolkit-troubleshooting
Step #1 - "gke-tpu-7x": - https://goo.gle/hpc-toolkit-validation
Step #1 - "gke-tpu-7x": 
Step #1 - "gke-tpu-7x": Validators can be silenced or treated as warnings or errors:
Step #1 - "gke-tpu-7x": 
Step #1 - "gke-tpu-7x": - https://goo.gle/hpc-toolkit-validation-levels
Step #1 - "gke-tpu-7x": 2026-06-03T08:36:33Z: Validation failed due to the issues listed above```


Link for complete logs: https://pantheon.corp.google.com/cloud-build/builds/46acf185-373e-42bd-837d-eeac3c60ca90?project=hpc-toolkit-dev&e=-13802955

Please resolve this as well.

Should be fixed now

@Neelabh94
Copy link
Copy Markdown
Contributor

/gcbrun

@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 4046060 to 76d9ade Compare June 3, 2026 18:25
@SwarnaBharathiMantena
Copy link
Copy Markdown
Contributor

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PR from external contributor release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants