Add Kueue default config for Dynamic Slicing#5693
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the GKE TPU workload management by integrating support for Kueue dynamic slicing. It updates the infrastructure configuration to allow users to enable dynamic slicing via new variables, introduces dedicated configuration templates, and adds robust validation logic to ensure that the required cluster settings are correctly configured for this feature. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for TPU dynamic slicing in GKE by adding new Kueue configuration templates and updating the kubectl-apply module with selection logic and validation rules. Review feedback identifies a need to broaden the machine_type validation regex to include ct types, suggests refactoring nested ternary expressions into a map lookup for improved maintainability, and recommends removing hardcoded TPU partition labels in the templates to support varied topologies.
There was a problem hiding this comment.
Code Review
This pull request introduces support for TPU Dynamic Slicing across GKE blueprints and the kubectl-apply management module, adding new Kueue configuration templates and validation logic. Feedback identifies that some blueprint settings reference non-existent module outputs, which should be resolved by defining them in the source modules to support automatic wiring. Additionally, the regex for TPU machine type validation is too restrictive and needs expansion, while hardcoded node labels in the templates should be parameterized for reusability. A redundant try() function in the Terraform configuration was also noted for removal.
8237060 to
48a7548
Compare
1bbae7d to
7a8e804
Compare
7a8e804 to
b646e94
Compare
36a45df to
091f67c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Dynamic Slicing for TPUs in GKE clusters by adding configuration variables, updating the Kueue installation module, and introducing new Kueue configuration templates. The review feedback highlights a few critical issues: first, the use of double dollar signs ($$) in the dynamic slicing template escapes interpolation and must be corrected to single dollar signs ($); second, the TPU partition topology label is hardcoded to 4x4x4 in both new templates, which should be made dynamic to support other topologies; and finally, the hardcoded validation check for tpu7x machine types in variables.tf should be removed to avoid unnecessary maintenance overhead.
|
/gcbrun |
1 similar comment
|
/gcbrun |
511ec7e to
8ea7784
Compare
|
/gcbrun |
Neelabh94
left a comment
There was a problem hiding this comment.
nit comments, otherwise LGTM
|
/gcbrun |
f3a4765 to
8c0c91d
Compare
|
/gcbrun |
|
The PR test for TPU7x also failed with the following error |
060e5d0 to
4046060
Compare
Should be fixed now |
|
/gcbrun |
4046060 to
76d9ade
Compare
|
/gcbrun |
Introduce enable_dynamic_slicing_for_tpus, adding:
the two more default Kueue configs,
Controller overrides defaults,
Validations.