Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/workflows/e2e-gpu-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,12 @@ jobs:
OPENSHELL_REGISTRY_NAMESPACE: nvidia/openshell
OPENSHELL_REGISTRY_USERNAME: ${{ github.actor }}
OPENSHELL_REGISTRY_PASSWORD: ${{ secrets.GITHUB_TOKEN }}
CONTAINER_ENGINE: docker
OPENSHELL_E2E_DOCKER_GPU: "1"
# NVIDIA-managed Ubuntu base used as the GPU probe target: it has the
# filesystem layout CDI injection expects (ldconfig, populated /usr/bin)
# which the distroless gateway runtime lacks. Consumed by the prereq
# probe below and by the e2e tests in e2e/rust/tests/gpu_device_selection.rs.
# probe below and by the e2e tests in e2e/rust/tests/gpu/device_selection.rs.
OPENSHELL_E2E_GPU_PROBE_IMAGE: "nvcr.io/nvidia/base/ubuntu:noble-20251013"
steps:
- uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
Expand All @@ -65,5 +66,8 @@ jobs:
docker info --format '{{json .CDISpecDirs}}'
docker run --rm --device nvidia.com/gpu=all "${OPENSHELL_E2E_GPU_PROBE_IMAGE}" nvidia-smi -L

- name: Build GPU workload images
run: mise run --no-deps --skip-deps e2e:workloads:build

- name: Run tests
run: mise run --no-deps --skip-deps e2e:docker:gpu
77 changes: 59 additions & 18 deletions e2e/gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

# GPU workload images

This directory defines workload test images for OpenShell GPU validation.
This directory defines workload test images currently used by the OpenShell GPU
e2e suite.

## Contract

Expand All @@ -22,11 +23,10 @@ Each workload image must:
command explicitly.

OpenShell sandbox creation replaces the image entrypoint with the supervisor and
does not run the OCI image `CMD`. When these images are used through OpenShell,
the workload command from each manifest entry must be passed explicitly.
does not run the OCI image `CMD`. E2e tests that use these images through
OpenShell run the command from each manifest entry explicitly.

The image build task writes a local workload manifest. Each workload entry
carries:
The test harness is manifest-driven. Each workload entry carries:

- `name`
- `image`
Expand Down Expand Up @@ -61,18 +61,17 @@ The build task uses `tasks/scripts/container-engine.sh`. Set
`CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to choose an engine
explicitly. When unset, the helper uses its existing auto-detection behavior.

Local tags use the current commit short SHA plus a short fingerprint of the
external build inputs. Dirty local trees append `-dirty`. Set
`OPENSHELL_GPU_WORKLOAD_IMAGE_TAG=<tag>` to override the tag.
Local tags use a short SHA-256 fingerprint of the selected workload contexts
and external build inputs. Set `OPENSHELL_GPU_WORKLOAD_IMAGE_TAG=<tag>` to
override the tag.

The task writes the latest build refs to:

```text
e2e/gpu/images/.build/latest.env
```

The task also writes a local workload manifest for downstream tooling and
future workload-runner integration:
The task also writes the local workload manifest used by the Rust e2e runner:

```text
e2e/gpu/images/.build/workloads.yaml
Expand All @@ -90,8 +89,7 @@ source e2e/gpu/images/.build/latest.env
```

That env file exports `OPENSHELL_E2E_WORKLOAD_MANIFEST` pointing at the local
manifest. The current checked-in Rust GPU e2e target does not consume this
manifest yet. The per-image refs remain available as a convenience for direct
manifest. The per-image refs remain available as a convenience for direct
container-engine validation.

## Direct Validation
Expand Down Expand Up @@ -124,14 +122,57 @@ where Podman CDI is configured.
Direct container-engine validation catches image, CDI, CUDA, and host GPU setup
issues before OpenShell sandbox behavior is involved.

## OpenShell GPU E2E
## Manifest-Driven Validation

The current Rust GPU validation target is:
The Rust GPU validation target is:

```shell
mise run e2e:gpu
cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-docker-gpu --test gpu -- --nocapture
```

That target runs `gpu_device_selection`. It validates GPU request and device
selection behavior against a Docker-backed gateway. It does not run the
workload manifest generated by `mise run e2e:workloads:build`.
The workload validation path reads:

```text
OPENSHELL_E2E_WORKLOAD_MANIFEST
```

When that variable is unset, the runner uses the default local manifest path:

```text
e2e/gpu/images/.build/workloads.yaml
```

If neither path exists, the workload validation test prints a clear skip
message telling you to run:

```shell
mise run e2e:workloads:build
```

or to set `OPENSHELL_E2E_WORKLOAD_MANIFEST` to an external manifest.

Each manifest entry supplies the sandbox image and command. OpenShell runs that
command through `openshell sandbox create --gpu --from <image> -- <command>`.
The test runner iterates all GPU-tagged workload entries and enforces each
entry's declared expectation:

- `expect: pass` requires `OPENSHELL_GPU_WORKLOAD_SUCCESS`
- `expect: fail` requires `OPENSHELL_GPU_WORKLOAD_FAILURE`

The current local manifest includes three workloads:

- `smoke-pass` expected to pass
- `smoke-fail` expected to fail
- `cuda-basic` expected to pass

## External Manifests

External workload catalogs can use the same schema. Point the runner at one
with:

```shell
export OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/workloads.yaml
```

That lets alternate workload manifests use the same test runner without
introducing per-workload env vars.
58 changes: 58 additions & 0 deletions e2e/rust/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 7 additions & 2 deletions e2e/rust/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,8 @@ path = "tests/forward_proxy_graphql_l7.rs"
required-features = ["e2e-host-gateway"]

[[test]]
name = "gpu_device_selection"
path = "tests/gpu_device_selection.rs"
name = "gpu"
path = "tests/gpu.rs"
required-features = ["e2e-gpu"]

[dependencies]
Expand All @@ -117,7 +117,12 @@ sha1 = "0.10"
sha2 = "0.10"
hex = "0.4"
rand = "0.9"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
serde_yaml = "0.9"

[dev-dependencies]
serial_test = "3"

[lints.rust]
unsafe_code = "warn"
Expand Down
5 changes: 5 additions & 0 deletions e2e/rust/e2e-docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,14 @@ set -euo pipefail
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
E2E_TEST="${OPENSHELL_E2E_DOCKER_TEST:-smoke}"
E2E_FEATURES="${OPENSHELL_E2E_DOCKER_FEATURES:-e2e,e2e-docker}"
DEFAULT_WORKLOAD_MANIFEST="${ROOT}/e2e/gpu/images/.build/workloads.yaml"

cargo build -p openshell-cli

if [ "${E2E_TEST}" = "gpu" ] && [ -z "${OPENSHELL_E2E_WORKLOAD_MANIFEST:-}" ] && [ ! -f "${DEFAULT_WORKLOAD_MANIFEST}" ]; then
echo "note: running GPU e2e without a workload manifest; workload validation will log an explicit skip. Build one with 'mise run e2e:workloads:build' or set OPENSHELL_E2E_WORKLOAD_MANIFEST."
fi

exec "${ROOT}/e2e/with-docker-gateway.sh" \
cargo test --manifest-path "${ROOT}/e2e/rust/Cargo.toml" \
--features "${E2E_FEATURES}" \
Expand Down
12 changes: 12 additions & 0 deletions e2e/rust/tests/gpu.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0

#![cfg(feature = "e2e-gpu")]

// GPU-consuming e2e tests use #[serial(gpu)] because common single-GPU hosts
// cannot reliably provision multiple GPU sandboxes at the same time.

#[path = "gpu/device_selection.rs"]
mod device_selection;
#[path = "gpu/workloads.rs"]
mod workloads;
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0

#![cfg(feature = "e2e-gpu")]

//! GPU device selection e2e tests.
//!
//! Requires a GPU-backed gateway and a sandbox image containing `nvidia-smi`.
Expand All @@ -15,6 +13,7 @@ use openshell_e2e::harness::container::{ContainerEngine, e2e_driver};
use openshell_e2e::harness::output::strip_ansi;
use openshell_e2e::harness::sandbox::SandboxGuard;
use serde_json::{Map, Value};
use serial_test::serial;
use tokio::time::timeout;

const SANDBOX_CREATE_TIMEOUT: Duration = Duration::from_secs(600);
Expand Down Expand Up @@ -340,6 +339,7 @@ async fn sandbox_create_output(args: &[&str]) -> String {
}

#[tokio::test]
#[serial(gpu)]
async fn gpu_request_without_device_matches_plain_default_gpu_container() {
let device_ids = discovered_cdi_gpu_device_ids();
let Some(default_gpu_device) =
Expand All @@ -359,6 +359,7 @@ async fn gpu_request_without_device_matches_plain_default_gpu_container() {
}

#[tokio::test]
#[serial(gpu)]
async fn gpu_request_for_each_discovered_device_matches_plain_container() {
let device_ids: Vec<_> = discovered_cdi_gpu_device_ids()
.into_iter()
Expand All @@ -383,6 +384,7 @@ async fn gpu_request_for_each_discovered_device_matches_plain_container() {
}

#[tokio::test]
#[serial(gpu)]
async fn gpu_all_device_request_matches_plain_all_gpu_container() {
if !has_cdi_gpu_device(CDI_GPU_DEVICE_ALL) {
eprintln!(
Expand All @@ -401,6 +403,7 @@ async fn gpu_all_device_request_matches_plain_all_gpu_container() {
}

#[tokio::test]
#[serial(gpu)]
async fn gpu_invalid_device_request_fails() {
let driver_config_json = cdi_devices_driver_config_json(&["nvidia.com/gpu=invalid"]);
let args = vec![
Expand Down
Loading
Loading