Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,15 @@ release. Look for failed installs, unexpected values, missing namespace, wrong
image tag, TLS settings that do not match the registered endpoint, and
scheduling failures.

When no external credential driver is enabled, the Helm chart uses the
gateway's default encrypted database credential storage. The chart creates a
retained Kubernetes Secret for the shared KEK, injects it into gateway pods, and
stores encrypted credential envelopes in the OpenShell database. For
`workload.kind=deployment` or multi-replica gateways, confirm
`server.externalDbSecret` points at a shared database. A render/install error
mentioning `server.credentialDrivers` means the values selected multiple
external credential backends.

For HA or PostgreSQL-backed installs, also check the external database Secret
referenced by `server.externalDbSecret` and the PostgreSQL workload if the test
or operator deployed one in-cluster:
Expand Down
47 changes: 45 additions & 2 deletions .github/workflows/branch-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ jobs:
run_core_e2e: ${{ steps.labels.outputs.run_core_e2e }}
run_gpu_e2e: ${{ steps.labels.outputs.run_gpu_e2e }}
run_kubernetes_ha_e2e: ${{ steps.labels.outputs.run_kubernetes_ha_e2e }}
run_kubernetes_credential_drivers_e2e: ${{ steps.labels.outputs.run_kubernetes_credential_drivers_e2e }}
run_any_e2e: ${{ steps.labels.outputs.run_any_e2e }}
steps:
- uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
Expand All @@ -41,12 +42,14 @@ jobs:
run_core_e2e=true
run_gpu_e2e=true
run_kubernetes_ha_e2e=true
run_kubernetes_credential_drivers_e2e=true
else
run_core_e2e="$(jq -r 'index("test:e2e") != null' <<< "$LABELS_JSON")"
run_gpu_e2e="$(jq -r 'index("test:e2e-gpu") != null' <<< "$LABELS_JSON")"
run_kubernetes_ha_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
run_kubernetes_credential_drivers_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
fi
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ] || [ "$run_kubernetes_ha_e2e" = "true" ]; then
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ] || [ "$run_kubernetes_ha_e2e" = "true" ] || [ "$run_kubernetes_credential_drivers_e2e" = "true" ]; then
run_any_e2e=true
else
run_any_e2e=false
Expand All @@ -55,12 +58,13 @@ jobs:
echo "run_core_e2e=$run_core_e2e"
echo "run_gpu_e2e=$run_gpu_e2e"
echo "run_kubernetes_ha_e2e=$run_kubernetes_ha_e2e"
echo "run_kubernetes_credential_drivers_e2e=$run_kubernetes_credential_drivers_e2e"
echo "run_any_e2e=$run_any_e2e"
} >> "$GITHUB_OUTPUT"

build-gateway:
needs: [pr_metadata]
if: needs.pr_metadata.outputs.should_run == 'true' && (needs.pr_metadata.outputs.run_core_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true')
if: needs.pr_metadata.outputs.should_run == 'true' && (needs.pr_metadata.outputs.run_core_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_credential_drivers_e2e == 'true')
permissions:
contents: read
packages: write
Expand Down Expand Up @@ -135,6 +139,18 @@ jobs:
extra-helm-values: deploy/helm/openshell/ci/values-high-availability.yaml
external-postgres-secret: openshell-ha-pg

kubernetes-credential-drivers-e2e:
needs: [pr_metadata, build-gateway, build-supervisor]
if: needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_kubernetes_credential_drivers_e2e == 'true'
permissions:
contents: read
packages: read
uses: ./.github/workflows/e2e-kubernetes-test.yml
with:
image-tag: ${{ github.sha }}
job-name: Kubernetes Credential Drivers E2E
e2e-task: e2e:kubernetes:credential-drivers

core-e2e-result:
name: Core E2E result
needs: [pr_metadata, build-gateway, build-supervisor, e2e, kubernetes-e2e]
Expand Down Expand Up @@ -215,3 +231,30 @@ jobs:
fi
done
exit "$failed"

kubernetes-credential-drivers-e2e-result:
name: Kubernetes Credential Drivers E2E result
needs: [pr_metadata, build-gateway, build-supervisor, kubernetes-credential-drivers-e2e]
if: always() && needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_kubernetes_credential_drivers_e2e == 'true'
runs-on: ubuntu-latest
steps:
- name: Verify Kubernetes credential drivers E2E jobs
env:
BUILD_GATEWAY_RESULT: ${{ needs.build-gateway.result }}
BUILD_SUPERVISOR_RESULT: ${{ needs.build-supervisor.result }}
KUBERNETES_CREDENTIAL_DRIVERS_E2E_RESULT: ${{ needs.kubernetes-credential-drivers-e2e.result }}
run: |
set -euo pipefail
failed=0
for item in \
"build-gateway:$BUILD_GATEWAY_RESULT" \
"build-supervisor:$BUILD_SUPERVISOR_RESULT" \
"kubernetes-credential-drivers-e2e:$KUBERNETES_CREDENTIAL_DRIVERS_E2E_RESULT"; do
name="${item%%:*}"
result="${item#*:}"
if [ "$result" != "success" ]; then
echo "::error::$name concluded $result"
failed=1
fi
done
exit "$failed"
10 changes: 8 additions & 2 deletions .github/workflows/e2e-kubernetes-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,11 @@ on:
required: false
type: string
default: ""
e2e-task:
description: "mise task to run for the Kubernetes e2e job"
required: false
type: string
default: "e2e:kubernetes"
mise-version:
description: "mise version to install on the bare Kubernetes e2e runner"
required: false
Expand Down Expand Up @@ -112,11 +117,12 @@ jobs:
kind load image-archive "$archive" --name "$KIND_CLUSTER_NAME"
done

- name: Run Kubernetes E2E (Rust smoke)
- name: Run Kubernetes E2E
env:
OPENSHELL_E2E_KUBE_CONTEXT: kind-${{ env.KIND_CLUSTER_NAME }}
OPENSHELL_E2E_KUBE_EXTRA_VALUES: ${{ inputs.extra-helm-values }}
OPENSHELL_E2E_KUBE_EXTERNAL_POSTGRES_SECRET: ${{ inputs.external-postgres-secret }}
IMAGE_TAG: ${{ inputs.image-tag }}
OPENSHELL_REGISTRY: ghcr.io/nvidia/openshell
run: mise run --no-deps --skip-deps e2e:kubernetes
E2E_TASK: ${{ inputs.e2e-task }}
run: mise run --no-deps --skip-deps "$E2E_TASK"
2 changes: 1 addition & 1 deletion .github/workflows/e2e-label-help.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
status_summary="The matching required CI gate status on this PR will flip green automatically once the run finishes."
;;
test:e2e-kubernetes)
suite_summary="Kubernetes HA E2E"
suite_summary="Kubernetes HA and credential-driver E2E"
build_summary="gateway and supervisor images"
status_summary="This is an optional proof-of-life suite; failures are visible in the workflow run but do not publish a required CI gate status."
;;
Expand Down
2 changes: 2 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
| `crates/openshell-core/` | Shared core | Common types, configuration, error handling |
| `crates/openshell-providers/` | Provider management | Credential provider backends |
| `crates/openshell-tui/` | Terminal UI | Ratatui-based dashboard for monitoring |
| `crates/openshell-driver-kubernetes-secrets/` | Kubernetes Secrets credential driver | In-process `CredentialDriver` backend for OpenShell-managed K8s Secret storage |
| `crates/openshell-driver-vault/` | Vault credential driver | In-process `CredentialDriver` backend for Vault-compatible KV storage |
| `crates/openshell-driver-kubernetes/` | Kubernetes compute driver | In-process `ComputeDriver` backend for K8s sandbox pods |
| `crates/openshell-driver-docker/` | Docker compute driver | In-process `ComputeDriver` backend for local Docker sandbox containers |
| `crates/openshell-driver-vm/` | VM compute driver | Standalone libkrun-backed `ComputeDriver` subprocess (embeds its own rootfs + runtime) |
Expand Down
7 changes: 4 additions & 3 deletions CI.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,11 @@ Three opt-in labels enable the long-running E2E suites:
- `test:e2e` runs the standard E2E suite in `Branch E2E Checks`
- `test:e2e-gpu` runs GPU E2E in `Branch E2E Checks`
- `test:e2e-kubernetes` runs Kubernetes E2E with the HA Helm overlay
(`replicaCount: 2` and bundled PostgreSQL) in `Branch E2E Checks`
(`replicaCount: 2` and bundled PostgreSQL) and the credential-driver suite
(Kubernetes Secrets plus Vault) in `Branch E2E Checks`

When multiple labels are present, `Branch E2E Checks` builds the shared gateway and supervisor images once and fans out all enabled suites in parallel.
The `OpenShell / E2E` and `OpenShell / GPU E2E` required statuses are evaluated from separate suite result jobs inside that workflow. `test:e2e-kubernetes` is optional while HA behavior is under active iteration: failures are visible in the workflow run but do not publish a required CI gate status.
The `OpenShell / E2E` and `OpenShell / GPU E2E` required statuses are evaluated from separate suite result jobs inside that workflow. `test:e2e-kubernetes` is optional while Kubernetes HA and credential-driver behavior are under active iteration: failures are visible in the workflow run but do not publish a required CI gate status.

The GitHub ruleset should require the `OpenShell / ...` statuses published by `Required CI Gates`, not the push-triggered workflow jobs directly.

Expand Down Expand Up @@ -110,7 +111,7 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma
| File | Role |
|---|---|
| `.github/workflows/branch-checks.yml` | Required non-E2E PR checks. Triggers on `push: pull-request/[0-9]+`. |
| `.github/workflows/branch-e2e.yml` | Opt-in standard, GPU, and Kubernetes HA E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e`, `test:e2e-gpu`, or `test:e2e-kubernetes`. |
| `.github/workflows/branch-e2e.yml` | Opt-in standard, GPU, Kubernetes HA, and Kubernetes credential-driver E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e`, `test:e2e-gpu`, or `test:e2e-kubernetes`. |
| `.github/workflows/helm-lint.yml` | Helm chart validation. Triggers on `push: pull-request/[0-9]+` and skips lint jobs unless Helm inputs changed. |
| `.github/actions/pr-gate/action.yml` | Composite action that resolves PR metadata and verifies the required label is set. |
| `.github/actions/pr-merge-base/action.yml` | Composite action that resolves and fetches the merge-base commit for `pull-request/<N>` push workflows. |
Expand Down
62 changes: 62 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ sha2 = "0.10"
rand = "0.9"
jsonwebtoken = "9"
getrandom = "0.3"
ring = "0.17"
spiffe = { version = "0.15", default-features = false, features = ["workload-api-jwt", "tracing"] }

# Filesystem embedding
Expand Down
10 changes: 10 additions & 0 deletions TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ Suites:
- Common suite (`--features e2e`) - driver-neutral CLI behavior, sandbox lifecycle, sync, port forwarding, policy, and provider tests.
- Docker suite (`--features e2e-docker`) - common suite plus Docker-only coverage such as Dockerfile image builds, Docker preflight checks, and managed Docker gateway resume.
- Docker GPU suite (`--features e2e-docker-gpu`) - Docker suite plus GPU sandbox smoke coverage.
- Kubernetes credential-driver suite (`--features e2e-kubernetes-credential-drivers`) - targeted Kubernetes Secrets and Vault provider credential storage coverage.

GPU device-selection tests compare OpenShell sandboxes against a plain Docker or
Podman container that requests `--device nvidia.com/gpu=all`. The probe image
Expand All @@ -173,6 +174,14 @@ Run the Podman-backed Rust CLI e2e suite:
mise run e2e:podman
```

Run the targeted Kubernetes credential-driver e2e suite. This deploys an
OpenBao fixture for the Vault-compatible driver path and validates Kubernetes
Secrets and Vault storage backends one at a time:

```shell
mise run e2e:kubernetes:credential-drivers
```

Run a single test directly with cargo:

```shell
Expand Down Expand Up @@ -203,3 +212,4 @@ The harness (`e2e/rust/src/harness/`) provides:
| `OPENSHELL_GATEWAY` | Override active gateway name for E2E tests |
| `OPENSHELL_GATEWAY_ENDPOINT` | Run E2E tests against an existing plaintext HTTP gateway endpoint |
| `OPENSHELL_E2E_DRIVER` | Driver name exported by the e2e gateway wrapper (`docker`, `podman`, or `vm`) |
| `OPENSHELL_E2E_CREDENTIAL_DRIVERS` | Enables the Kubernetes credential-driver fixture path in `e2e/with-kube-gateway.sh` |
11 changes: 8 additions & 3 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,9 +159,14 @@ default WAL journal mode), which mirror the same sensitive contents.
Persisted state includes sandboxes, providers, provider credential refresh
state, SSH sessions, policy revisions, settings, inference configuration, and
deployment records. Provider refresh material is stored as a separate object
scoped to the provider instance through `objects.scope`; the provider record
keeps only the current injectable credential values and optional per-credential
expiry timestamps.
scoped to the provider instance through `objects.scope`. Provider records keep
inline credential values only for legacy records created before credential
driver storage. New provider writes keep driver-owned credential handles and
optional per-credential expiry timestamps. When no external credential driver
is configured, gateways use server-owned encrypted database credential storage
for defense in depth. Multi-replica deployments can use that default with a
shared database and shared key-encryption key, or opt into an external backend such as Vault
or Kubernetes Secrets.

### Optimistic Concurrency (CAS)

Expand Down
Loading
Loading