Skip to content

feat: add first-class sandbox-to-sandbox network policy primitive #1993

Description

@ajtran

Problem Statement

Multi-agent systems that run each agent component (reasoning, orchestration, action execution) in its own isolated sandbox need to communicate directly between sandboxes. Today, OpenShell's default-deny network policy blocks all inter-sandbox egress, and there is no first-class policy primitive to express sandbox-to-sandbox intent. Operators must manually determine ephemeral pod IPs and write raw CIDR rules — an impractical requirement given that pod IPs change on rescheduling.

Technical Context

The SSRF enforcement layer in the proxy already supports RFC1918 destinations via allowed_ips on a NetworkEndpointDef. The gap is not in enforcement — it is in the policy authoring surface and the gateway's failure to automatically resolve and inject a peer sandbox's current pod IP at policy-load time. An operator who manually writes the correct allowed_ips CIDR today can already reach another sandbox pod through the proxy. What does not exist is the abstraction that makes this declarative and durable across pod rescheduling.

Affected Components

Component Key Files Role
Policy engine crates/openshell-policy/src/lib.rs NetworkEndpointDef schema — where target_sandbox field would be added
Proxy SSRF enforcement crates/openshell-supervisor-network/src/proxy.rs Evaluates OPA decision + resolves IPs; already handles allowed_ips path
Policy composition crates/openshell-policy/src/compose.rs Where gateway resolves sandbox pod IP and injects into policy before sending to supervisor
OPA input construction crates/openshell-supervisor-network/src/opa.rs Builds the input document for per-request evaluation; allowed_ips is set here
nftables namespace rules crates/openshell-supervisor-process/src/netns/nft_ruleset.rs Applies inside sandbox network namespace — does not need to change

Technical Investigation

Architecture Overview

Sandbox egress is enforced at two independent layers:

Layer 1 — OPA + SSRF (primary, per-request, in the proxy process):
handle_tcp_connection in proxy.rs (line ~700) evaluates each CONNECT request through OPA, then resolves the destination through one of four SSRF paths depending on what the matching policy rule declares. The relevant path for RFC1918 is resolve_and_check_allowed_ips (line 2574): if the policy rule has allowed_ips populated, the resolved IP must fall within those CIDRs. Without allowed_ips, the fallback resolve_and_reject_internal (line 2556) calls is_internal_ip, which blocks all RFC1918 space — this is where sandbox-to-sandbox traffic dies today.

Layer 2 — nftables (defence-in-depth, namespace-level):
nft_ruleset.rs installs an output chain inside the sandbox's private network namespace that accepts only traffic destined for the proxy's veth IP and rejects everything else. This means sandbox processes cannot bypass the proxy to dial other pods directly. This layer does not need to change — inter-sandbox traffic still flows through the proxy, which then dials the remote sandbox pod from the host network namespace.

Why nftables does not block the fix: The proxy runs in the host network namespace and is not subject to the sandbox's nftables rules. Once OPA+SSRF clears the connection, the proxy dials the destination directly and has full access to cluster routing.

Code References

Location Description
crates/openshell-policy/src/lib.rs:89 NetworkEndpointDef struct — target_sandbox field would be added here
crates/openshell-supervisor-network/src/proxy.rs:700 handle_tcp_connection — main CONNECT decision tree
crates/openshell-supervisor-network/src/proxy.rs:2556 resolve_and_reject_internal — rejects RFC1918 on the default path
crates/openshell-supervisor-network/src/proxy.rs:2574 resolve_and_check_allowed_ips — permits RFC1918 when allowed_ips is declared
crates/openshell-supervisor-network/src/opa.rs:1060 OPA input construction — where ep["allowed_ips"] is set
crates/openshell-supervisor-process/src/netns/nft_ruleset.rs Namespace-level nftables rules (no change needed)
crates/openshell-core/src/net.rs is_always_blocked_ip, is_internal_ip, is_always_blocked_net

Current Behavior

When sandbox A issues a CONNECT to sandbox B's pod IP (a RFC1918 address), OPA may allow the host, but the SSRF fallback path resolve_and_reject_internal classifies the resolved IP as internal and returns a 403. The operator has no policy YAML primitive to express "allow traffic to sandbox B" — they must know the current pod IP/CIDR and write a raw allowed_ips rule, which becomes stale when sandbox B is rescheduled.

What Would Need to Change

  1. Policy schema (crates/openshell-policy/src/lib.rs:89): Add optional target_sandbox_id: Option<String> to NetworkEndpointDef. This is the new declarative primitive.

  2. Proto (proto/sandbox_policy.proto): Add target_sandbox_id field to NetworkEndpoint. Wire format change — backwards compatible (optional field).

  3. Gateway policy composition (crates/openshell-policy/src/compose.rs): When composing policy to send to a supervisor, resolve the current pod IP of any target_sandbox_id endpoint via the compute driver / K8s API and inject it as allowed_ips. The supervisor's existing config polling loop already re-fetches policy on changes, so rescheduled sandboxes will get updated IPs within one poll interval.

  4. OPA input construction (crates/openshell-supervisor-network/src/opa.rs:1060): No change needed — allowed_ips already flows through this path.

  5. CLI/SDK: Add UX for authoring target_sandbox rules (policy YAML authoring, generate-sandbox-policy skill support).

Alternative Approaches Considered

  • Manual allowed_ips today: Already works. Unblocks single-pod setups immediately but is not operationally durable across rescheduling. Documents as a workaround while this is built.
  • DNS-based resolution (no pod IP injection): Use the sandbox's Kubernetes Service DNS name as the host and rely on the proxy's DNS resolution. This avoids pod IP tracking but requires each sandbox to have a stable Service. Viable for long-lived sandboxes; less useful for ephemeral ones.

Patterns to Follow

  • NetworkEndpointDef field additions follow the existing optional-field pattern (allowed_ips, protocols).
  • Policy composition already has a pattern for gateway-side enrichment before sending to supervisor — follow the same pipeline.
  • The BLOCKED_CONTROL_PLANE_PORTS list (line 2608 in proxy.rs) must remain enforced even on the target_sandbox path.

Proposed Approach

Add target_sandbox_id as an optional field on NetworkEndpointDef in policy YAML and proto. At policy composition time, the gateway resolves the target sandbox's current pod IP via the compute driver and injects it as allowed_ips. The proxy's existing resolve_and_check_allowed_ips SSRF path handles enforcement with no changes. The supervisor's config polling loop ensures pod IP updates propagate when a sandbox is rescheduled.

Scope Assessment

  • Complexity: Medium
  • Confidence: High — the enforcement layer already works; this is plumbing and schema work
  • Estimated files to change: ~6–8
  • Issue type: feat

Risks & Open Questions

  • Pod IP staleness window: Between a sandbox rescheduling and the next config poll, the injected allowed_ips is stale. The poll interval determines the outage window. Should the gateway proactively push a policy update when it detects a sandbox pod IP change?
  • Circular dependency: If sandbox A's policy depends on sandbox B's IP, and sandbox B depends on sandbox A's IP, both need to be resolved before either can start. Is this a real scenario and does it need a resolution order?
  • is_always_blocked_ip enforcement: Loopback and link-local remain blocked even via allowed_ips. Confirm this is correct for inter-sandbox traffic (it should be — those addresses are never a sandbox pod IP).
  • BLOCKED_CONTROL_PLANE_PORTS: The control-plane port blocklist must remain enforced on the target_sandbox path. Confirm no sandbox-to-sandbox use case requires etcd/kubelet ports.

Test Considerations

  • Unit tests for the new target_sandbox_idallowed_ips injection in policy composition
  • Unit tests in proxy.rs mirroring the existing resolve_and_check_allowed_ips coverage (lines 3471–3600) for the sandbox-targeted path
  • E2e test: sandbox A reaches a service on sandbox B after policy with target_sandbox_id is applied (requires test:e2e coverage)
  • Test the staleness scenario: verify that after sandbox B is rescheduled, the next config poll restores connectivity

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions