Skip to content

fix(e2e): stabilize local Docker smoke test#1935

Open
elezar wants to merge 1 commit into
mainfrom
fix-local-e2e
Open

fix(e2e): stabilize local Docker smoke test#1935
elezar wants to merge 1 commit into
mainfrom
fix-local-e2e

Conversation

@elezar

@elezar elezar commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Fix local Docker e2e by honoring the configured supervisor image and isolating host OpenSSH from Nix/devenv linker variables while preserving the proxy child environment.

Related Issue

N/A

Changes

  • Honor the configured Docker supervisor image when running local Docker e2e.
  • Scrub host linker environment variables before launching external ssh.
  • Restore the original linker environment for the openshell ssh-proxy child command.
  • Add focused CLI tests for the split SSH/proxy environment behavior.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Additional validation run:

  • /home/elezar/.cargo/bin/cargo fmt --check
  • /home/elezar/.cargo/bin/cargo test -p openshell-cli linker_environment
  • /home/elezar/.cargo/bin/cargo clippy -p openshell-cli --all-targets -- -D warnings
  • mise run e2e:docker

mise run pre-commit was attempted locally but did not complete: helm lint reported missing chart dependency metadata for postgresql, and three openshell-supervisor-process tests failed because this shell could not resolve the current user/group. The clippy failure from this change was fixed and rechecked.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@elezar elezar requested review from a team, derekwaynecarr and mrunalp as code owners June 16, 2026 19:15
@github-actions

Copy link
Copy Markdown

@elezar elezar added test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage labels Jun 16, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e applied for 5b97b26. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for 5b97b26. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@elezar

elezar commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored, project-valid Docker e2e and SSH environment stabilization work. It keeps local Docker e2e aligned with the configured supervisor image and prevents host linker variables from breaking the external OpenSSH process while preserving the proxy child environment.
Head SHA: 5b97b26a7af0618d2b86569e01a9a94fcccd6406

Review findings:

  • No blocking findings remain.
  • Nonblocking suggestion from the independent review: the internal DockerComputeConfig.supervisor_image comment is stale because configured supervisor_image now takes precedence over sibling/local builds when supervisor_bin is unset.

Docs: relevant Docker supervisor image behavior is reflected in the Docker driver README and docs/reference/gateway-config.mdx; no docs navigation change appears necessary.
E2E: test:e2e and test:e2e-gpu are already applied because this affects Docker e2e/supervisor behavior.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, and DCO are passing. OpenShell / E2E and OpenShell / GPU E2E are pending for this head.

Next state: gator:watch-pipeline

@elezar elezar added the gator:watch-pipeline Gator is monitoring PR CI/CD status label Jun 16, 2026
@elezar

elezar commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI Update

I re-evaluated latest head 5b97b26a7af0618d2b86569e01a9a94fcccd6406 after the required E2E gates completed.

Disposition: not resolved.

Remaining items:

  • Still blocking: OpenShell / E2E failed. The e2e / E2E (rust-docker) and e2e / E2E (rust-podman-rootless) jobs both failed tests/forward_proxy_graphql_l7.rs::graphql_l7_enforces_allow_and_deny_rules_on_forward_and_connect_paths.
  • Docker observed forward_query_allowed = 403 where the test expected 200.
  • Rootless Podman observed forward_get_query_allowed = 403 where the test expected 200.
  • OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for this head.

Next state: gator:in-review

@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 16, 2026
@elezar elezar requested a review from maxamillion as a code owner June 17, 2026 07:05
@elezar elezar enabled auto-merge (squash) June 17, 2026 12:19
pimlock
pimlock previously approved these changes Jun 17, 2026
Comment on lines +2964 to +2969
// Tier 2: explicit supervisor_image in [openshell.drivers.docker].
// A configured image should be the source of truth even when a local
// developer build is present under target/.
if let Some(image) = docker_config.supervisor_image.clone() {
return extract_supervisor_bin_from_image(docker, &image).await;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment here:

/// Optional override for the image the gateway pulls to extract the
/// Linux `openshell-sandbox` binary when no explicit binary path or
/// local build is available. Defaults to
/// `ghcr.io/nvidia/openshell/supervisor:<gateway-image-tag>`.
pub supervisor_image: Option<String>,
that should be updated after changing order in which the supervisor bin is resolved.

Could be something like:

/// Optional image used to extract the Linux `openshell-sandbox` binary.
/// Ignored when `supervisor_bin` is set. See `resolve_supervisor_bin` for
/// the full resolution order.
pub supervisor_image: Option<String>,

It could make sense to validate the config and only allow either supervisor_bin or supervisor_image, so both cannot be set. But that would be a breaking change.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be good to have a test for this new resolution order.

@elezar elezar Jun 23, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense to validate the config and only allow either supervisor_bin or supervisor_image, so both cannot be set. But that would be a breaking change.

If we think that will be clearer, then breaking now probably makes sense. Having these settings be exclusive does reduce the questions around resolution precedence.

I don't want to do that in this PR though. If we think that's a direction we want to go, then let's make that change in a separate PR so that the possibly breaking change is more clearly surfaced.

Comment thread crates/openshell-cli/src/ssh.rs Outdated
}

#[test]
#[allow(unsafe_code)] // Test-only: env vars require unsafe in Rust 2024.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use temp-env here (cli crate already has a dep on it) to remove the unsafe and resetting the env var.

I think it could also remove the need for the lock, based on their docs.

Avoid interference when running concurrently

@elezar elezar added gator:blocked Gator is blocked by process or repository gates and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 23, 2026
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked by merge conflicts with the base branch. GitHub reports mergeable_state=dirty for head 242ace258f3cb2eda8dd8d4150db6ba600da10e7.

Next action: @elezar, please rebase or merge the base branch and resolve the conflicts, then push an updated head so gator can re-check review state and CI.

@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head af7a19c85f5fdfb770f4fe851e3fd5a3f8bae5e2 after the branch update since the 2026-06-23 10:10 UTC merge-conflict blocker.

Disposition: partially resolved.

Remaining items:

  • Resolved: GitHub now reports mergeable=true and mergeable_state=blocked, so the previous dirty merge-conflict blocker is cleared.
  • Still blocking: OpenShell / Branch Checks failed for this head. Both Rust lint jobs fail on crates/openshell-driver-docker/src/lib.rs:2979 with clippy::collapsible_if.
  • OpenShell / E2E and OpenShell / GPU E2E are still pending for this head; gator will re-check after the lint failure is fixed and CI completes.

Next state: gator:in-review

@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback and removed gator:blocked Gator is blocked by process or repository gates labels Jun 23, 2026
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: maintainer-authored, project-valid local Docker e2e and SSH environment stabilization work.
Review: no blocking review findings remain. The earlier CI blocker on head af7a19c85f5fdfb770f4fe851e3fd5a3f8bae5e2 is resolved.
Docs: relevant Docker supervisor image behavior is covered in the Docker driver README and docs/reference/gateway-config.mdx; no docs navigation change appears necessary.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and required CI gate statuses are passing for head af7a19c85f5fdfb770f4fe851e3fd5a3f8bae5e2.
E2E: test:e2e and test:e2e-gpu are applied; OpenShell / E2E and OpenShell / GPU E2E are passing.

Human maintainer approval or merge decision is now required.

@elezar elezar removed the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 23, 2026
@elezar elezar added the gator:approval-needed Gator completed review; maintainer approval needed label Jun 23, 2026
@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback and removed gator:approval-needed Gator completed review; maintainer approval needed labels Jun 23, 2026
@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 3ed9a9796ac83be84c772e8e071da15d41a3b3c5 after the head changed from the prior approval handoff, and ran a bounded independent re-review of the current diff.

Disposition: partially resolved, but not ready to leave review.

Remaining items:

  • Still pending: OpenShell / Branch Checks, OpenShell / E2E, and OpenShell / GPU E2E are queued for this head, so the prior approval handoff for af7a19c85f5fdfb770f4fe851e3fd5a3f8bae5e2 is stale.
  • Review warning: e2e/with-docker-gateway.sh now builds the default local openshell/supervisor:dev image without pinning the build platform to the Docker daemon architecture. If an ambient DOCKER_PLATFORM is set to a different platform, local Docker e2e can build/extract the wrong-architecture supervisor and fail later with an exec-format error. Please derive the daemon architecture and pass DOCKER_PLATFORM=linux/<daemon-arch> into the local supervisor image build, or explain why the inherited platform cannot be mismatched in this path.
  • Review warning: the new crates/openshell-cli/src/ssh.rs tests that call temp_env::with_vars should take TEST_ENV_LOCK, matching the module's existing process-environment tests, to avoid parallel test flakes.

Docs: relevant Docker supervisor image behavior remains covered in the Docker driver README and docs/reference/gateway-config.mdx; no docs navigation change appears necessary.

Next state: gator:in-review

@elezar

elezar commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After CI Update

I re-evaluated latest head 3ed9a9796ac83be84c772e8e071da15d41a3b3c5 after the required gates completed.

Disposition: partially resolved, but not ready to leave review.

Remaining items:

  • Resolved: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, OpenShell / E2E, and OpenShell / GPU E2E are now passing for this head.
  • Still unresolved: the review warning in e2e/with-docker-gateway.sh about pinning the local supervisor image build platform to the Docker daemon architecture, or explaining why inherited DOCKER_PLATFORM cannot be mismatched in this path.
  • Still unresolved: the crates/openshell-cli/src/ssh.rs tests that call temp_env::with_vars should take TEST_ENV_LOCK, matching the module's existing process-environment tests, to avoid parallel test flakes.

Next state: gator:in-review

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants