chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0#720
Conversation
5dcd08b to
1faade0
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdated Helm chart sources and versions across recipes, overlays, examples, and tests: Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1faade0 to
df0b237
Compare
df0b237 to
c86e0ad
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/registry.yaml`:
- Around line 362-364: Update the Helm test fixtures in helm_test.go so they
match the registry defaults: replace any occurrences of the old version string
"v0.13.0" with "v0.14.1" and replace the old repository
"oci://ghcr.io/nvidia/kai-scheduler" with
"oci://ghcr.io/kai-scheduler/kai-scheduler"; ensure the test expectations
(fixture YAML/strings that reference defaultRepository and defaultVersion) and
any assertions in pkg/bundler/deployer/helm/helm_test.go reflect these new
values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3364be62-5311-40fb-8877-4651ee2b90a4
📒 Files selected for processing (7)
docs/contributor/component.mddocs/contributor/data.mddocs/user/component-catalog.mdexamples/recipes/aks-training.yamlrecipes/overlays/base.yamlrecipes/registry.yamlvalidators/performance/trainer_lifecycle.go
c86e0ad to
7b0951e
Compare
7b0951e to
dc1e70f
Compare
dc1e70f to
b6ec050
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@demos/cuj1-gke.md`:
- Around line 115-125: The toleration values in the TrainJob YAML snippet are
inconsistent with the rest of the CUJ: under the tolerations block (keys
"tolerations", "key", "operator", "value", "effect" alongside "nodeSelector:
nodeGroup: gpu-worker") replace both occurrences of value: worker-workload with
value: gpu-workload so the taint used by the TrainJob matches the gpu-workload
taint referenced earlier in the guide (snapshot/bundle/validate commands).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: aa94ea30-c0ef-4bd8-af1d-b0cc1dd6c4c0
📒 Files selected for processing (12)
demos/cuj1-eks.mddemos/cuj1-gke.mddocs/contributor/component.mddocs/contributor/data.mddocs/user/component-catalog.mdexamples/recipes/aks-training.yamlpkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.envpkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.mdrecipes/overlays/base.yamlrecipes/registry.yamlvalidators/performance/trainer_lifecycle.go
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
b6ec050 to
2a22f17
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
The `nvsentinel` registry entry declared:
defaultRepository: https://helm.ngc.nvidia.com/nvidia
defaultChart: nvidia/nvsentinel
But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.
Update the defaults to match what every overlay already uses:
defaultRepository: oci://ghcr.io/nvidia
defaultChart: nvsentinel
Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:
$ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
Pulled.
$ aicr bundle -r recipe.yaml -o /tmp/bundle
... generates upstream.env with
CHART='oci://ghcr.io/nvidia/nvsentinel'
REPO=''
VERSION='v1.3.0'
Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.
Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
The `nvsentinel` registry entry declared:
defaultRepository: https://helm.ngc.nvidia.com/nvidia
defaultChart: nvidia/nvsentinel
But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.
Update the defaults to match what every overlay already uses:
defaultRepository: oci://ghcr.io/nvidia
defaultChart: nvsentinel
Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:
$ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
Pulled.
$ aicr bundle -r recipe.yaml -o /tmp/bundle
... generates upstream.env with
CHART='oci://ghcr.io/nvidia/nvsentinel'
REPO=''
VERSION='v1.3.0'
Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.
Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
Summary
Two Phase-2 follow-ups from #698, batched into one PR because both are small chart-pin changes coupled to a single non-pin tweak each. Also includes minor docs cleanup picked up alongside.
kai-schedulerv0.13.0v0.14.1oci://ghcr.io/nvidia/kai-scheduler→oci://ghcr.io/kai-scheduler/kai-scheduler)kubeflow-trainer2.1.02.2.0validators/performance/trainer_lifecycle.goMotivation / Context
Both bumps were excluded from the Phase-1 PR (#715) for clean reasons:
NVIDIA/tokai-scheduler/org, chart publishing moved with it. The oldghcr.io/nvidia/kai-schedulernamespace is frozen atv0.13.0; the full release stream lives atghcr.io/kai-scheduler/kai-scheduler. AICR's recipe was still pointing at the frozen old path. This is an OCI source migration plus a version bump — coupled changes that belong together.recipes/registry.yamlis coupled with the hardcoded fallback archive URL invalidators/performance/trainer_lifecycle.go. The validator's no-CRD install path downloads a hardcodedv2.1.0GitHub archive; if we bump the chart pin without bumping the URL, the fallback installsv2.1.0manifests while the chart deploysv2.2.0. To keep chore(recipes): bump 6 components to upstream latest (phase 1) #715 pure config / no Go changes, this was deferred.kai-scheduler — verified clean
v0.13.0values.yamladdition is an opt-invpa:block (enabled: falsedefault)global.tolerations,admission.gpuPodRuntimeClassName,postCleanup.enabled) all still apply unchangedhelm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1succeedskubeflow-trainer — verified clean
manifests/overlays/managerkustomize, sametrainjobs.trainer.kubeflow.org/v1alpha1CRD identity, samekubeflow-systemnamespaceoci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0verified pullableCompanion fixes
examples/recipes/aks-training.yaml— refresh the kai-scheduler example pin (source URL + version) to track the new registry default. Matches the example-pin convention from chore(recipes): bump 6 components to upstream latest (phase 1) #715 (chore(recipes): bump kube-prometheus-stack, prometheus-adapter, kai-scheduler, nvsentinel #283, chore: update skyhook to latest version #336, feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling #450) — only this one example references kai-scheduler.docs/user/component-catalog.md— update the KAI Scheduler upstream link fromgithub.com/NVIDIA/KAI-Schedulerto the newgithub.com/kai-scheduler/KAI-Scheduler(the upstream GitHub repo migrated alongside the OCI registry).docs/contributor/data.mdanddocs/contributor/component.md— update illustrative cert-manager YAML/JSON snippets fromv1.17.2→v1.20.2to match the post-chore(recipes): bump 6 components to upstream latest (phase 1) #715 registry default. These were tagged as cosmetic-drift in the chore(recipes): bump 6 components to upstream latest (phase 1) #715 cross-review and deferred; rolling them in here.Fixes: N/A
Related: #698 (follow-up items 3 and 5), follows-up #715 (Phase-1 PR)
Type of Change
Component(s) Affected
pkg/recipe) — registry default + base overlay pinspkg/validator) — fallback archive URL invalidators/performance/trainer_lifecycle.godocs/,examples/) — example pin refresh, KAI Scheduler upstream link, contributor doc snippet versionsImplementation Notes
PodTemplateOverridesis replaced byruntimePatches(kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the v2.2 controller no longer applies it; pods come out with no override fields. Thepytorch-mnistdemo TrainJob indemos/cuj1-eks.mdanddemos/cuj1-gke.mdis migrated to theruntimePatchesshape withmanager: aicr.nvidia.com/demoand explicit per-cluster scheduling (EKS demo:dedicated=worker-workload:NoSchedule|NoExecute; GKE demo:dedicated=gpu-workload:NoSchedule+nvidia.com/gpu=present:NoScheduleto match the rest of the GKE flow).mlPolicy.torch.numProcPerNodeis removed (kubeflow/trainer#3239) — Torch now infers parallelism from the container'snvidia.com/gpuresource limit. AICR'storch-distributedClusterTrainingRuntime inrecipes/components/kubeflow-trainer/manifests/is updated frommlPolicy.torch: { numProcPerNode: auto }tomlPolicy.torch: {}, matching the v2.2.0 reference runtime.mlPolicy.mpi.numProcPerNodeis unaffected upstream, so MPI test fixtures stay as-is.site/docs/mirror is gitignored (auto-generated fromdocs/); only the canonicaldocs/was edited.Known caveat 1 — kubeflow-trainer v2.1.0 -> v2.2.0 upgrades leave CRDs stale
Helm 3/4 does NOT upgrade CRDs on
helm upgrade— only on first install (CRDs ship in the chart'scrds/directory, which Helm treats as install-only). Clusters that previously deployed kubeflow-trainer v2.1.0 retain the v2.1.0 CRD afterhelm upgradeto v2.2.0. The v2.1.0 CRD haspodTemplateOverridesbut noruntimePatches— so the migrated demo TrainJob is rejected at admission withunknown field "spec.runtimePatches"until CRDs are explicitly upgraded:Fresh deploys (via
aicr bundle+deploy.sh) are unaffected — Helm installs the v2.2.0 CRDs at first install.A follow-up improvement is to have the bundler emit CRD upgrade commands in
install.shfor charts that shipcrds/, but that's out of scope for this version-bump PR.Known caveat 2 — kai-scheduler default queue requires explicit priorityClass for Deployment-style workloads
The kai-scheduler chart (both v0.13.0 and v0.14.1) ships
default-queueanddefault-parent-queuewithgpu.quota: 0andlimit: -1. Combined with kai's pod-grouper priority-class auto-assignment, this means:priorityClassName: train(priority 50, preemptible) → can go overquota: 0becauselimit: -1is unlimited. Examples: thegang-scheduling-test.yamlinpkg/evidence/scripts/manifests/and TrainJob-driven workloads. These work out of the box.priorityClassName: inference(priority 125, non-preemptible) → blocked byquota: 0withNonPreemptibleOverQuota. These fail until either the workload sets an explicit preemptible priorityClass or the queue's quota is raised.Verified empirically on a deployed cluster (kai v0.14.1):
gang-scheduling-test.yamlapplied as-is on a fresh cluster → both pods scheduled, completed in ~14s (auto-trainpriority).schedulerName: kai-schedulerand DRAResourceClaimTemplaterequestinggpu.nvidia.comdevices → blocked withNonPreemptibleOverQuota(auto-inferencepriority).priorityClassName: trainadded to the pod template → both replicas scheduled, each with its own H100 via DRA. Quota stayed at 0; no chart values change required.Workaround for Deployment-style kai workloads: set
priorityClassName: train(orbuild/build-preemptible, depending on workload semantics) on the pod template. The chart ships these priorityClasses intemplates/priorityclasses/.This is pre-existing kai/AICR behavior — the v0.13.0 chart shipped the same defaults; it's surfaced now because PR #720 is the first time the kai-scheduler bump is being explicitly tested with a
schedulerName: kai-schedulerDeployment-style workload. A follow-up improvement could either (a) document this prominently indocs/user/...or (b) overridedefaultQueue.parentResources.gpu.quotainrecipes/components/kai-scheduler/values.yamlso inference-style workloads work without per-workload priorityClass tagging. Out of scope for this version-bump PR.Testing
Risk Assessment
Rollout notes: No migration steps. Bundles regenerated post-merge will pull from the new kai-scheduler OCI namespace and the new kubeflow-trainer chart version. Existing installations are unaffected until re-bundled. The validator fallback path will install the v2.2.0 trainer if invoked on a cluster without the chart pre-installed.
Checklist
make testwith-race)make lint)git commit -S)