chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0 by yuanchen8911 · Pull Request #720 · NVIDIA/aicr

yuanchen8911 · 2026-04-30T17:04:12Z

Summary

Two Phase-2 follow-ups from #698, batched into one PR because both are small chart-pin changes coupled to a single non-pin tweak each. Also includes minor docs cleanup picked up alongside.

Component	Current	New	Coupled change
`kai-scheduler`	`v0.13.0`	`v0.14.1`	OCI registry namespace migration (`oci://ghcr.io/nvidia/kai-scheduler` → `oci://ghcr.io/kai-scheduler/kai-scheduler`)
`kubeflow-trainer`	`2.1.0`	`2.2.0`	Validator fallback archive URL bump in `validators/performance/trainer_lifecycle.go`

Motivation / Context

Both bumps were excluded from the Phase-1 PR (#715) for clean reasons:

kai-scheduler — when the upstream repo was transferred from NVIDIA/ to kai-scheduler/ org, chart publishing moved with it. The old ghcr.io/nvidia/kai-scheduler namespace is frozen at v0.13.0; the full release stream lives at ghcr.io/kai-scheduler/kai-scheduler. AICR's recipe was still pointing at the frozen old path. This is an OCI source migration plus a version bump — coupled changes that belong together.
kubeflow-trainer — the chart pin in recipes/registry.yaml is coupled with the hardcoded fallback archive URL in validators/performance/trainer_lifecycle.go. The validator's no-CRD install path downloads a hardcoded v2.1.0 GitHub archive; if we bump the chart pin without bumping the URL, the fallback installs v2.1.0 manifests while the chart deploys v2.2.0. To keep chore(recipes): bump 6 components to upstream latest (phase 1) #715 pure config / no Go changes, this was deferred.

kai-scheduler — verified clean

41/41 rendered templates and identical kinds/counts vs v0.13.0
Only values.yaml addition is an opt-in vpa: block (enabled: false default)
Our customizations (global.tolerations, admission.gpuPodRuntimeClassName, postCleanup.enabled) all still apply unchanged
New OCI namespace verified pullable: helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 succeeds

kubeflow-trainer — verified clean

v2.2.0 archive layout unchanged from v2.1.0: same manifests/overlays/manager kustomize, same trainjobs.trainer.kubeflow.org/v1alpha1 CRD identity, same kubeflow-system namespace
Only difference upstream is the controller-manager image tag
Chart oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 verified pullable

Companion fixes

examples/recipes/aks-training.yaml — refresh the kai-scheduler example pin (source URL + version) to track the new registry default. Matches the example-pin convention from chore(recipes): bump 6 components to upstream latest (phase 1) #715 (chore(recipes): bump kube-prometheus-stack, prometheus-adapter, kai-scheduler, nvsentinel #283, chore: update skyhook to latest version #336, feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling #450) — only this one example references kai-scheduler.
docs/user/component-catalog.md — update the KAI Scheduler upstream link from github.com/NVIDIA/KAI-Scheduler to the new github.com/kai-scheduler/KAI-Scheduler (the upstream GitHub repo migrated alongside the OCI registry).
docs/contributor/data.md and docs/contributor/component.md — update illustrative cert-manager YAML/JSON snippets from v1.17.2 → v1.20.2 to match the post-chore(recipes): bump 6 components to upstream latest (phase 1) #715 registry default. These were tagged as cosmetic-drift in the chore(recipes): bump 6 components to upstream latest (phase 1) #715 cross-review and deferred; rolling them in here.

Fixes: N/A
Related: #698 (follow-up items 3 and 5), follows-up #715 (Phase-1 PR)

Type of Change

Build/CI/tooling
Documentation update

Component(s) Affected

Recipe engine / data (pkg/recipe) — registry default + base overlay pins
Validator (pkg/validator) — fallback archive URL in validators/performance/trainer_lifecycle.go
Docs/examples (docs/, examples/) — example pin refresh, KAI Scheduler upstream link, contributor doc snippet versions

Implementation Notes

The kai-scheduler change is two coupled lines per file (registry default + version, then overlay source + version) — the OCI source path and version pin must move together.
The kubeflow-trainer change is one registry version line plus three lines in the Go validator (URL constant + 2 doc comments referencing the version). The validator's behavior is otherwise unchanged: same kustomize overlay path, same CRD identity check.
Two v2.2.0 breaking-change consequences addressed in this PR:
- PodTemplateOverrides is replaced by runtimePatches (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the v2.2 controller no longer applies it; pods come out with no override fields. The pytorch-mnist demo TrainJob in demos/cuj1-eks.md and demos/cuj1-gke.md is migrated to the runtimePatches shape with manager: aicr.nvidia.com/demo and explicit per-cluster scheduling (EKS demo: dedicated=worker-workload:NoSchedule|NoExecute; GKE demo: dedicated=gpu-workload:NoSchedule + nvidia.com/gpu=present:NoSchedule to match the rest of the GKE flow).
- mlPolicy.torch.numProcPerNode is removed (kubeflow/trainer#3239) — Torch now infers parallelism from the container's nvidia.com/gpu resource limit. AICR's torch-distributed ClusterTrainingRuntime in recipes/components/kubeflow-trainer/manifests/ is updated from mlPolicy.torch: { numProcPerNode: auto } to mlPolicy.torch: {}, matching the v2.2.0 reference runtime. mlPolicy.mpi.numProcPerNode is unaffected upstream, so MPI test fixtures stay as-is.
Net diff: 13 files, +75/-34 lines. The site/docs/ mirror is gitignored (auto-generated from docs/); only the canonical docs/ was edited.

Known caveat 1 — kubeflow-trainer v2.1.0 -> v2.2.0 upgrades leave CRDs stale

Helm 3/4 does NOT upgrade CRDs on helm upgrade — only on first install (CRDs ship in the chart's crds/ directory, which Helm treats as install-only). Clusters that previously deployed kubeflow-trainer v2.1.0 retain the v2.1.0 CRD after helm upgrade to v2.2.0. The v2.1.0 CRD has podTemplateOverrides but no runtimePatches — so the migrated demo TrainJob is rejected at admission with unknown field "spec.runtimePatches" until CRDs are explicitly upgraded:

helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 --untar -d /tmp/
kubectl apply -f /tmp/kubeflow-trainer/crds/ --server-side --force-conflicts

Fresh deploys (via aicr bundle + deploy.sh) are unaffected — Helm installs the v2.2.0 CRDs at first install.

A follow-up improvement is to have the bundler emit CRD upgrade commands in install.sh for charts that ship crds/, but that's out of scope for this version-bump PR.

Known caveat 2 — kai-scheduler default queue requires explicit priorityClass for Deployment-style workloads

The kai-scheduler chart (both v0.13.0 and v0.14.1) ships default-queue and default-parent-queue with gpu.quota: 0 and limit: -1. Combined with kai's pod-grouper priority-class auto-assignment, this means:

Bare Pods / Job-style workloads → pod-grouper assigns priorityClassName: train (priority 50, preemptible) → can go over quota: 0 because limit: -1 is unlimited. Examples: the gang-scheduling-test.yaml in pkg/evidence/scripts/manifests/ and TrainJob-driven workloads. These work out of the box.
Deployment / ReplicaSet workloads → pod-grouper's Deployment Grouper auto-assigns priorityClassName: inference (priority 125, non-preemptible) → blocked by quota: 0 with NonPreemptibleOverQuota. These fail until either the workload sets an explicit preemptible priorityClass or the queue's quota is raised.

Verified empirically on a deployed cluster (kai v0.14.1):

gang-scheduling-test.yaml applied as-is on a fresh cluster → both pods scheduled, completed in ~14s (auto-train priority).
2-replica Deployment with schedulerName: kai-scheduler and DRA ResourceClaimTemplate requesting gpu.nvidia.com devices → blocked with NonPreemptibleOverQuota (auto-inference priority).
Same Deployment with explicit priorityClassName: train added to the pod template → both replicas scheduled, each with its own H100 via DRA. Quota stayed at 0; no chart values change required.

Workaround for Deployment-style kai workloads: set priorityClassName: train (or build / build-preemptible, depending on workload semantics) on the pod template. The chart ships these priorityClasses in templates/priorityclasses/.

This is pre-existing kai/AICR behavior — the v0.13.0 chart shipped the same defaults; it's surfaced now because PR #720 is the first time the kai-scheduler bump is being explicitly tested with a schedulerName: kai-scheduler Deployment-style workload. A follow-up improvement could either (a) document this prominently in docs/user/... or (b) override defaultQueue.parentResources.gpu.quota in recipes/components/kai-scheduler/values.yaml so inference-style workloads work without per-workload priorityClass tagging. Out of scope for this version-bump PR.

Testing

make tidy                                   # no-op (clean)
make lint                                   # 0 issues
go test -count=1 ./pkg/recipe/...           # ok
go test -count=1 ./validators/performance/... # ok

# End-to-end: bundle a kubeflow-using EKS training recipe
$ aicr recipe --service eks --accelerator h100 --intent training \
              --os ubuntu --platform kubeflow -o recipe.yaml
$ aicr bundle -r recipe.yaml -o /tmp/bundle
... succeeds; kai-scheduler and kubeflow-trainer per-component
    install.sh artifacts produced cleanly.

Risk Assessment

Low — Both bumps are minor within the same major; values surfaces are unchanged (kai-scheduler) or cosmetic-only (kubeflow-trainer); the OCI namespace migration is a registry-side path change verified end-to-end. Doc changes are illustrative / link updates only.

Rollout notes: No migration steps. Bundles regenerated post-merge will pull from the new kai-scheduler OCI namespace and the new kubeflow-trainer chart version. Existing installations are unaffected until re-bundled. The validator fallback path will install the v2.2.0 trainer if invoked on a cluster without the chart pre-installed.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — version bumps + URL constant)
I updated docs if user-facing behavior changed (KAI link refresh + cert-manager doc snippet versions)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-04-30T17:05:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updated Helm chart sources and versions across recipes, overlays, examples, and tests: kai-scheduler OCI source changed to oci://ghcr.io/kai-scheduler/kai-scheduler and chart bumped v0.13.0 → v0.14.1; kubeflow-trainer bumped 2.1.0 → 2.2.0 and its installer archive tag updated. Documentation examples (cert-manager bumped to v1.20.2) and the component catalog link were adjusted. Demo docs replaced podTemplateOverrides with runtimePatches. A ClusterTrainingRuntime manifest removed mlPolicy.torch.numProcPerNode in favor of an empty torch: {} object. Test golden and testdata updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title accurately summarizes the main changes: bumping kai-scheduler to v0.14.1 and kubeflow-trainer to 2.2.0, with clear component names and versions.
Description check	✅ Passed	Description thoroughly documents the PR changes, motivation, implementation notes, testing, and risk assessment, directly relating to the changeset.
Linked Issues check	✅ Passed	PR fulfills Phase-2 follow-ups from `#715`: kai-scheduler v0.13.0→v0.14.1 with OCI migration, kubeflow-trainer 2.1.0→2.2.0 with validator URL bump, plus companion doc/example fixes.
Out of Scope Changes check	✅ Passed	All changes align with stated objectives: chart bumps, OCI migration, validator URL, example/doc updates, and demo migration to runtimePatches—no unrelated scope creep.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yuanchen8911 · 2026-04-30T17:07:48Z

@coderabbitai review

coderabbitai · 2026-04-30T17:07:54Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/registry.yaml`:
- Around line 362-364: Update the Helm test fixtures in helm_test.go so they
match the registry defaults: replace any occurrences of the old version string
"v0.13.0" with "v0.14.1" and replace the old repository
"oci://ghcr.io/nvidia/kai-scheduler" with
"oci://ghcr.io/kai-scheduler/kai-scheduler"; ensure the test expectations
(fixture YAML/strings that reference defaultRepository and defaultVersion) and
any assertions in pkg/bundler/deployer/helm/helm_test.go reflect these new
values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3364be62-5311-40fb-8877-4651ee2b90a4

📥 Commits

Reviewing files that changed from the base of the PR and between 1faade0 and c86e0ad.

📒 Files selected for processing (7)

docs/contributor/component.md
docs/contributor/data.md
docs/user/component-catalog.md
examples/recipes/aks-training.yaml
recipes/overlays/base.yaml
recipes/registry.yaml
validators/performance/trainer_lifecycle.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@demos/cuj1-gke.md`:
- Around line 115-125: The toleration values in the TrainJob YAML snippet are
inconsistent with the rest of the CUJ: under the tolerations block (keys
"tolerations", "key", "operator", "value", "effect" alongside "nodeSelector:
nodeGroup: gpu-worker") replace both occurrences of value: worker-workload with
value: gpu-workload so the taint used by the TrainJob matches the gpu-workload
taint referenced earlier in the guide (snapshot/bundle/validate commands).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: aa94ea30-c0ef-4bd8-af1d-b0cc1dd6c4c0

📥 Commits

Reviewing files that changed from the base of the PR and between 7b0951e and b6ec050.

📒 Files selected for processing (12)

demos/cuj1-eks.md
demos/cuj1-gke.md
docs/contributor/component.md
docs/contributor/data.md
docs/user/component-catalog.md
examples/recipes/aks-training.yaml
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md
recipes/overlays/base.yaml
recipes/registry.yaml
validators/performance/trainer_lifecycle.go

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...

yuanchen8911 · 2026-04-30T21:30:10Z

@coderabbitai review

coderabbitai · 2026-04-30T21:30:18Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

The `nvsentinel` registry entry declared: defaultRepository: https://helm.ngc.nvidia.com/nvidia defaultChart: nvidia/nvsentinel But the chart isn't published to the HTTPS NGC index — only to the OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are silently ignored today: every nvsentinel-using overlay sets its own `source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken HTTPS default never resolves. But anyone relying on the registry defaults (e.g. via `aicr bundle` without explicit overlay overrides on this entry) would hit the dead path. Update the defaults to match what every overlay already uses: defaultRepository: oci://ghcr.io/nvidia defaultChart: nvsentinel Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path in `defaultRepository`, bare chart name in `defaultChart`). Verified locally: $ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0 Pulled. $ aicr bundle -r recipe.yaml -o /tmp/bundle ... generates upstream.env with CHART='oci://ghcr.io/nvidia/nvsentinel' REPO='' VERSION='v1.3.0' Note: other NGC HTTPS entries in the registry (gpu-operator, network-operator, nodewright-operator, nvidia-dra-driver-gpu) are unchanged — those charts are genuinely served by the HTTPS NGC index. nvsentinel is special because it ships only via OCI. Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)

yuanchen8911 requested review from a team as code owners April 30, 2026 17:04

yuanchen8911 added enhancement New feature or request area/recipes area/validator labels Apr 30, 2026

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from 5dcd08b to 1faade0 Compare April 30, 2026 17:04

github-actions Bot added size/S and removed area/validator labels Apr 30, 2026

yuanchen8911 mentioned this pull request Apr 30, 2026

chore(recipes): check and update runtime component versions across all recipes #698

Closed

7 tasks

yuanchen8911 marked this pull request as draft April 30, 2026 17:06

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from 1faade0 to df0b237 Compare April 30, 2026 17:31

github-actions Bot added the area/docs label Apr 30, 2026

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from df0b237 to c86e0ad Compare April 30, 2026 17:34

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread recipes/registry.yaml

mchmarny previously approved these changes Apr 30, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Apr 30, 2026

yuanchen8911 dismissed mchmarny’s stale review via 7b0951e April 30, 2026 19:45

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from c86e0ad to 7b0951e Compare April 30, 2026 19:45

github-actions Bot added the area/bundler label Apr 30, 2026

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from 7b0951e to dc1e70f Compare April 30, 2026 20:59

github-actions Bot added size/M and removed size/S labels Apr 30, 2026

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from dc1e70f to b6ec050 Compare April 30, 2026 21:00

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread demos/cuj1-gke.md Outdated

yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from b6ec050 to 2a22f17 Compare April 30, 2026 21:25

yuanchen8911 marked this pull request as ready for review April 30, 2026 21:45

yuanchen8911 requested a review from mchmarny April 30, 2026 21:45

yuanchen8911 enabled auto-merge (squash) April 30, 2026 21:55

yuanchen8911 requested a review from lockwobr April 30, 2026 21:59

yuanchen8911 mentioned this pull request Apr 30, 2026

fix(recipes): handle kubeflow-trainer v2.2.0 API changes #724

Merged

10 tasks

lockwobr approved these changes Apr 30, 2026

View reviewed changes

yuanchen8911 merged commit 14ff3fa into NVIDIA:main Apr 30, 2026
89 checks passed

yuanchen8911 mentioned this pull request Apr 30, 2026

fix(recipes): correct nvsentinel registry default to OCI source #725

Merged

10 tasks

yuanchen8911 mentioned this pull request May 13, 2026

chore(recipes): migrate kgateway -> agentgateway for v2.2 inference routing #871

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0#720

chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0#720
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/recipes-bump-kai-kubeflow

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

yuanchen8911 commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

yuanchen8911 commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

kai-scheduler — verified clean

kubeflow-trainer — verified clean

Companion fixes

Type of Change

Component(s) Affected

Implementation Notes

Known caveat 1 — kubeflow-trainer v2.1.0 -> v2.2.0 upgrades leave CRDs stale

Known caveat 2 — kai-scheduler default queue requires explicit priorityClass for Deployment-style workloads

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

yuanchen8911 commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanchen8911 commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading