Skip to content

chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0#720

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/recipes-bump-kai-kubeflow
Apr 30, 2026
Merged

chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0#720
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/recipes-bump-kai-kubeflow

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 30, 2026

Summary

Two Phase-2 follow-ups from #698, batched into one PR because both are small chart-pin changes coupled to a single non-pin tweak each. Also includes minor docs cleanup picked up alongside.

Component Current New Coupled change
kai-scheduler v0.13.0 v0.14.1 OCI registry namespace migration (oci://ghcr.io/nvidia/kai-scheduleroci://ghcr.io/kai-scheduler/kai-scheduler)
kubeflow-trainer 2.1.0 2.2.0 Validator fallback archive URL bump in validators/performance/trainer_lifecycle.go

Motivation / Context

Both bumps were excluded from the Phase-1 PR (#715) for clean reasons:

  • kai-scheduler — when the upstream repo was transferred from NVIDIA/ to kai-scheduler/ org, chart publishing moved with it. The old ghcr.io/nvidia/kai-scheduler namespace is frozen at v0.13.0; the full release stream lives at ghcr.io/kai-scheduler/kai-scheduler. AICR's recipe was still pointing at the frozen old path. This is an OCI source migration plus a version bump — coupled changes that belong together.
  • kubeflow-trainer — the chart pin in recipes/registry.yaml is coupled with the hardcoded fallback archive URL in validators/performance/trainer_lifecycle.go. The validator's no-CRD install path downloads a hardcoded v2.1.0 GitHub archive; if we bump the chart pin without bumping the URL, the fallback installs v2.1.0 manifests while the chart deploys v2.2.0. To keep chore(recipes): bump 6 components to upstream latest (phase 1) #715 pure config / no Go changes, this was deferred.

kai-scheduler — verified clean

  • 41/41 rendered templates and identical kinds/counts vs v0.13.0
  • Only values.yaml addition is an opt-in vpa: block (enabled: false default)
  • Our customizations (global.tolerations, admission.gpuPodRuntimeClassName, postCleanup.enabled) all still apply unchanged
  • New OCI namespace verified pullable: helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 succeeds

kubeflow-trainer — verified clean

  • v2.2.0 archive layout unchanged from v2.1.0: same manifests/overlays/manager kustomize, same trainjobs.trainer.kubeflow.org/v1alpha1 CRD identity, same kubeflow-system namespace
  • Only difference upstream is the controller-manager image tag
  • Chart oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 verified pullable

Companion fixes

Fixes: N/A
Related: #698 (follow-up items 3 and 5), follows-up #715 (Phase-1 PR)

Type of Change

  • Build/CI/tooling
  • Documentation update

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — registry default + base overlay pins
  • Validator (pkg/validator) — fallback archive URL in validators/performance/trainer_lifecycle.go
  • Docs/examples (docs/, examples/) — example pin refresh, KAI Scheduler upstream link, contributor doc snippet versions

Implementation Notes

  • The kai-scheduler change is two coupled lines per file (registry default + version, then overlay source + version) — the OCI source path and version pin must move together.
  • The kubeflow-trainer change is one registry version line plus three lines in the Go validator (URL constant + 2 doc comments referencing the version). The validator's behavior is otherwise unchanged: same kustomize overlay path, same CRD identity check.
  • Two v2.2.0 breaking-change consequences addressed in this PR:
    • PodTemplateOverrides is replaced by runtimePatches (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the v2.2 controller no longer applies it; pods come out with no override fields. The pytorch-mnist demo TrainJob in demos/cuj1-eks.md and demos/cuj1-gke.md is migrated to the runtimePatches shape with manager: aicr.nvidia.com/demo and explicit per-cluster scheduling (EKS demo: dedicated=worker-workload:NoSchedule|NoExecute; GKE demo: dedicated=gpu-workload:NoSchedule + nvidia.com/gpu=present:NoSchedule to match the rest of the GKE flow).
    • mlPolicy.torch.numProcPerNode is removed (kubeflow/trainer#3239) — Torch now infers parallelism from the container's nvidia.com/gpu resource limit. AICR's torch-distributed ClusterTrainingRuntime in recipes/components/kubeflow-trainer/manifests/ is updated from mlPolicy.torch: { numProcPerNode: auto } to mlPolicy.torch: {}, matching the v2.2.0 reference runtime. mlPolicy.mpi.numProcPerNode is unaffected upstream, so MPI test fixtures stay as-is.
  • Net diff: 13 files, +75/-34 lines. The site/docs/ mirror is gitignored (auto-generated from docs/); only the canonical docs/ was edited.

Known caveat 1 — kubeflow-trainer v2.1.0 -> v2.2.0 upgrades leave CRDs stale

Helm 3/4 does NOT upgrade CRDs on helm upgrade — only on first install (CRDs ship in the chart's crds/ directory, which Helm treats as install-only). Clusters that previously deployed kubeflow-trainer v2.1.0 retain the v2.1.0 CRD after helm upgrade to v2.2.0. The v2.1.0 CRD has podTemplateOverrides but no runtimePatches — so the migrated demo TrainJob is rejected at admission with unknown field "spec.runtimePatches" until CRDs are explicitly upgraded:

helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 --untar -d /tmp/
kubectl apply -f /tmp/kubeflow-trainer/crds/ --server-side --force-conflicts

Fresh deploys (via aicr bundle + deploy.sh) are unaffected — Helm installs the v2.2.0 CRDs at first install.

A follow-up improvement is to have the bundler emit CRD upgrade commands in install.sh for charts that ship crds/, but that's out of scope for this version-bump PR.

Known caveat 2 — kai-scheduler default queue requires explicit priorityClass for Deployment-style workloads

The kai-scheduler chart (both v0.13.0 and v0.14.1) ships default-queue and default-parent-queue with gpu.quota: 0 and limit: -1. Combined with kai's pod-grouper priority-class auto-assignment, this means:

  • Bare Pods / Job-style workloads → pod-grouper assigns priorityClassName: train (priority 50, preemptible) → can go over quota: 0 because limit: -1 is unlimited. Examples: the gang-scheduling-test.yaml in pkg/evidence/scripts/manifests/ and TrainJob-driven workloads. These work out of the box.
  • Deployment / ReplicaSet workloads → pod-grouper's Deployment Grouper auto-assigns priorityClassName: inference (priority 125, non-preemptible) → blocked by quota: 0 with NonPreemptibleOverQuota. These fail until either the workload sets an explicit preemptible priorityClass or the queue's quota is raised.

Verified empirically on a deployed cluster (kai v0.14.1):

  • gang-scheduling-test.yaml applied as-is on a fresh cluster → both pods scheduled, completed in ~14s (auto-train priority).
  • 2-replica Deployment with schedulerName: kai-scheduler and DRA ResourceClaimTemplate requesting gpu.nvidia.com devices → blocked with NonPreemptibleOverQuota (auto-inference priority).
  • Same Deployment with explicit priorityClassName: train added to the pod template → both replicas scheduled, each with its own H100 via DRA. Quota stayed at 0; no chart values change required.

Workaround for Deployment-style kai workloads: set priorityClassName: train (or build / build-preemptible, depending on workload semantics) on the pod template. The chart ships these priorityClasses in templates/priorityclasses/.

This is pre-existing kai/AICR behavior — the v0.13.0 chart shipped the same defaults; it's surfaced now because PR #720 is the first time the kai-scheduler bump is being explicitly tested with a schedulerName: kai-scheduler Deployment-style workload. A follow-up improvement could either (a) document this prominently in docs/user/... or (b) override defaultQueue.parentResources.gpu.quota in recipes/components/kai-scheduler/values.yaml so inference-style workloads work without per-workload priorityClass tagging. Out of scope for this version-bump PR.

Testing

make tidy                                   # no-op (clean)
make lint                                   # 0 issues
go test -count=1 ./pkg/recipe/...           # ok
go test -count=1 ./validators/performance/... # ok

# End-to-end: bundle a kubeflow-using EKS training recipe
$ aicr recipe --service eks --accelerator h100 --intent training \
              --os ubuntu --platform kubeflow -o recipe.yaml
$ aicr bundle -r recipe.yaml -o /tmp/bundle
... succeeds; kai-scheduler and kubeflow-trainer per-component
    install.sh artifacts produced cleanly.

Risk Assessment

  • Low — Both bumps are minor within the same major; values surfaces are unchanged (kai-scheduler) or cosmetic-only (kubeflow-trainer); the OCI namespace migration is a registry-side path change verified end-to-end. Doc changes are illustrative / link updates only.

Rollout notes: No migration steps. Bundles regenerated post-merge will pull from the new kai-scheduler OCI namespace and the new kubeflow-trainer chart version. Existing installations are unaffected until re-bundled. The validator fallback path will install the v2.2.0 trainer if invoked on a cluster without the chart pre-installed.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — version bumps + URL constant)
  • I updated docs if user-facing behavior changed (KAI link refresh + cert-manager doc snippet versions)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested review from a team as code owners April 30, 2026 17:04
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from 5dcd08b to 1faade0 Compare April 30, 2026 17:04
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated Helm chart sources and versions across recipes, overlays, examples, and tests: kai-scheduler OCI source changed to oci://ghcr.io/kai-scheduler/kai-scheduler and chart bumped v0.13.0 → v0.14.1; kubeflow-trainer bumped 2.1.0 → 2.2.0 and its installer archive tag updated. Documentation examples (cert-manager bumped to v1.20.2) and the component catalog link were adjusted. Demo docs replaced podTemplateOverrides with runtimePatches. A ClusterTrainingRuntime manifest removed mlPolicy.torch.numProcPerNode in favor of an empty torch: {} object. Test golden and testdata updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately summarizes the main changes: bumping kai-scheduler to v0.14.1 and kubeflow-trainer to 2.2.0, with clear component names and versions.
Description check ✅ Passed Description thoroughly documents the PR changes, motivation, implementation notes, testing, and risk assessment, directly relating to the changeset.
Linked Issues check ✅ Passed PR fulfills Phase-2 follow-ups from #715: kai-scheduler v0.13.0→v0.14.1 with OCI migration, kubeflow-trainer 2.1.0→2.2.0 with validator URL bump, plus companion doc/example fixes.
Out of Scope Changes check ✅ Passed All changes align with stated objectives: chart bumps, OCI migration, validator URL, example/doc updates, and demo migration to runtimePatches—no unrelated scope creep.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from 1faade0 to df0b237 Compare April 30, 2026 17:31
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from df0b237 to c86e0ad Compare April 30, 2026 17:34
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/registry.yaml`:
- Around line 362-364: Update the Helm test fixtures in helm_test.go so they
match the registry defaults: replace any occurrences of the old version string
"v0.13.0" with "v0.14.1" and replace the old repository
"oci://ghcr.io/nvidia/kai-scheduler" with
"oci://ghcr.io/kai-scheduler/kai-scheduler"; ensure the test expectations
(fixture YAML/strings that reference defaultRepository and defaultVersion) and
any assertions in pkg/bundler/deployer/helm/helm_test.go reflect these new
values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3364be62-5311-40fb-8877-4651ee2b90a4

📥 Commits

Reviewing files that changed from the base of the PR and between 1faade0 and c86e0ad.

📒 Files selected for processing (7)
  • docs/contributor/component.md
  • docs/contributor/data.md
  • docs/user/component-catalog.md
  • examples/recipes/aks-training.yaml
  • recipes/overlays/base.yaml
  • recipes/registry.yaml
  • validators/performance/trainer_lifecycle.go

Comment thread recipes/registry.yaml
mchmarny
mchmarny previously approved these changes Apr 30, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from c86e0ad to 7b0951e Compare April 30, 2026 19:45
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from 7b0951e to dc1e70f Compare April 30, 2026 20:59
@github-actions github-actions Bot added size/M and removed size/S labels Apr 30, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from dc1e70f to b6ec050 Compare April 30, 2026 21:00
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@demos/cuj1-gke.md`:
- Around line 115-125: The toleration values in the TrainJob YAML snippet are
inconsistent with the rest of the CUJ: under the tolerations block (keys
"tolerations", "key", "operator", "value", "effect" alongside "nodeSelector:
nodeGroup: gpu-worker") replace both occurrences of value: worker-workload with
value: gpu-workload so the taint used by the TrainJob matches the gpu-workload
taint referenced earlier in the guide (snapshot/bundle/validate commands).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: aa94ea30-c0ef-4bd8-af1d-b0cc1dd6c4c0

📥 Commits

Reviewing files that changed from the base of the PR and between 7b0951e and b6ec050.

📒 Files selected for processing (12)
  • demos/cuj1-eks.md
  • demos/cuj1-gke.md
  • docs/contributor/component.md
  • docs/contributor/data.md
  • docs/user/component-catalog.md
  • examples/recipes/aks-training.yaml
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md
  • recipes/overlays/base.yaml
  • recipes/registry.yaml
  • validators/performance/trainer_lifecycle.go

Comment thread demos/cuj1-gke.md Outdated
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, demo
migration to RuntimePatches, and ClusterTrainingRuntime alignment
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships two breaking API changes that touch AICR:

  1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309).
     The CRD still admits the old field for compat but the v2.2
     controller no longer applies it. The pytorch-mnist demo TrainJob
     in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to
     the `runtimePatches` shape with `manager: aicr.nvidia.com/demo`
     and explicit per-cluster scheduling (the EKS demo carries the
     AICR-standard `dedicated=worker-workload` tolerations + NoExecute
     effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule`
     and `nvidia.com/gpu=present:NoSchedule` to match the rest of the
     GKE flow).

  2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239).
     Upstream removed the field from the Torch policy because it now
     infers parallelism from the container's `nvidia.com/gpu` limit.
     `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI
     test fixtures stay as-is. AICR's `torch-distributed`
     ClusterTrainingRuntime is updated from
     `mlPolicy.torch: { numProcPerNode: auto }` to
     `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime.

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
demo TrainJob admitted, pod scheduled with the migrated runtimePatches,
training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade
baseline). 2-replica Deployment with `schedulerName: kai-scheduler` +
DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also
scheduled cleanly with `priorityClassName: train` (each replica got
its own H100 via DRA).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-kai-kubeflow branch from b6ec050 to 2a22f17 Compare April 30, 2026 21:25
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 30, 2026 21:45
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 30, 2026 21:45
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 30, 2026 21:55
@yuanchen8911 yuanchen8911 requested a review from lockwobr April 30, 2026 21:59
@yuanchen8911 yuanchen8911 merged commit 14ff3fa into NVIDIA:main Apr 30, 2026
89 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
The `nvsentinel` registry entry declared:

    defaultRepository: https://helm.ngc.nvidia.com/nvidia
    defaultChart: nvidia/nvsentinel

But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.

Update the defaults to match what every overlay already uses:

    defaultRepository: oci://ghcr.io/nvidia
    defaultChart: nvsentinel

Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:

  $ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
  Pulled.
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... generates upstream.env with
      CHART='oci://ghcr.io/nvidia/nvsentinel'
      REPO=''
      VERSION='v1.3.0'

Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.

Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
The `nvsentinel` registry entry declared:

    defaultRepository: https://helm.ngc.nvidia.com/nvidia
    defaultChart: nvidia/nvsentinel

But the chart isn't published to the HTTPS NGC index — only to the
OCI registry at `oci://ghcr.io/nvidia/nvsentinel`. The defaults are
silently ignored today: every nvsentinel-using overlay sets its own
`source: oci://ghcr.io/nvidia` + chart `nvsentinel`, so the broken
HTTPS default never resolves. But anyone relying on the registry
defaults (e.g. via `aicr bundle` without explicit overlay overrides
on this entry) would hit the dead path.

Update the defaults to match what every overlay already uses:

    defaultRepository: oci://ghcr.io/nvidia
    defaultChart: nvsentinel

Same shape as the kai-scheduler entry post-NVIDIA#720 (OCI registry path
in `defaultRepository`, bare chart name in `defaultChart`). Verified
locally:

  $ helm pull oci://ghcr.io/nvidia/nvsentinel --version v1.3.0
  Pulled.
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... generates upstream.env with
      CHART='oci://ghcr.io/nvidia/nvsentinel'
      REPO=''
      VERSION='v1.3.0'

Note: other NGC HTTPS entries in the registry (gpu-operator,
network-operator, nodewright-operator, nvidia-dra-driver-gpu) are
unchanged — those charts are genuinely served by the HTTPS NGC
index. nvsentinel is special because it ships only via OCI.

Refs: NVIDIA#698 (Phase 1 follow-up NVIDIA#2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants