rfc-0009: supervisor middleware#1738
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Add a Contract versioning subsection (openshell.middleware.v1, ProxyMiddleware service, GetCapabilities version handshake) and registration reload/removal semantics. Rename the gRPC service to ProxyMiddleware, defer max_body_bytes and timeout_ms to implementation, and record the v1alpha1-vs-v1 open question.
4810be8 to
9730e6f
Compare
| message Outcome { | ||
| Decision decision = 1; // ALLOW or DENY | ||
| string deny_reason = 2; // safe, machine-readable | ||
| map<string, string> set_headers = 3; // subject to an OpenShell allow-list |
There was a problem hiding this comment.
Would this be the existing allow-list? Do we need to worry about making sure credential placeholders (which might deprecate eventually anyway) can't be re-written?
There was a problem hiding this comment.
Good question. Maybe for the initial version it would make sense to have these as append only and disallow any special headers?
We could also drop this, I don't have a clear use-case for this yet, other than this is part of the request. We could drop it and wait for a feature request.
If we were to deprecate placeholders, we'd replace it with injecting creds based on config? E.g. we'd modify the request and add Authorization: key header (if that's how auth works for that provider).
| string deny_reason = 2; // safe, machine-readable | ||
| map<string, string> set_headers = 3; // subject to an OpenShell allow-list | ||
| map<string, string> metadata = 4; // namespaced, no raw values | ||
| repeated Finding finding = 5; // labels, counts, confidence |
There was a problem hiding this comment.
I don't see a definition for this so just curious what that might look like.
There was a problem hiding this comment.
Good catch. The idea here was to have something that could generate finding OCSF events.
We already emit it in a few places.
For the middleware contract, it could be something like this:
{
"type": "pii.email",
"label": "Email address",
"count": 3,
"confidence": "...",
"severity": "..."
}
And based on that, we'd emit event from the supervisor.
|
|
||
| The interface is gRPC. The hot-path RPC is declared as a bidirectional stream, but v1 exchanges exactly one `ProcessRequest` and one `ProcessResponse` over it: the supervisor buffers the bounded body and the middleware replies once. Declaring it as a stream now is deliberate, because gRPC method cardinality cannot change compatibly. It lets a later version chunk large payloads without altering the method signature. Possible extensions (chunked streaming, additional hooks, semantic context) are collected in the [protocol-extensions appendix](appendices/protocol-extensions.md), including what streaming does and does not buy. The baseline middleware ships in the supervisor and is served in-process over the same gRPC contract, with no network hop. The exact field set is settled during implementation; the sketch above is the contract shape this RFC asks reviewers to evaluate. | ||
|
|
||
| The `actor` process is the same identity OpenShell already resolves on the egress path - the binary, pid, and ancestor chain it uses for binary-scoped network policy and OCSF audit. It is resolved when the connection is established, so it is per-connection rather than strictly per-request: over a reused or pooled connection it identifies the process that opened the connection, which a middleware should not over-trust for per-request attribution. |
There was a problem hiding this comment.
Given the direction we're going on explicitly setting modes/modules for the supervisor, should we be able to communicate that "no actor data" is available if we're in a proxy-only situation?
There was a problem hiding this comment.
Good point. I included the actor process data for complete context about the request (e.g. could be helpful for logging), but we could also drop it at this point, or ensure it's documented as optional.
Maybe drop for now?
|
|
||
| Implementation-wise, the hook is a new supervisor-side Rust enforcement stage selected by policy data, not a Rego rule. Existing Rego evaluation remains the metadata gate: L4 policy admits the connection and, where the endpoint declares a `protocol`, L7 policy admits the parsed request; the supervisor then buffers the bounded body, calls the configured middleware chain, and applies the returned decision or transformation. The hook does not depend on L7 Rego running - on a parsed request to an endpoint with no `protocol`, the chain still runs after L4 admits the connection. Request bodies do not become Rego input in v1. A later design may add a declarative pass over middleware findings, but v1 applies middleware outcomes directly. | ||
|
|
||
| ```yaml |
There was a problem hiding this comment.
This config makes sense in the "single policy document" world. How do we reference network policies that are defined in provider profiles? I think it would make sense as best practice to "define your middlewares in the sandbox policy" but it's likely the bulk of your network policies are coming from providers. Similarly to how we support global assignments via the requests object could we also assigned middleware to a provider type?
There was a problem hiding this comment.
This is a great point.
Yes, we could reference providers profiles in the middleware include mechanism (requests/endpoints). We would operate on the provider profile, not provider instance here, right? If the profile gets removed, we just configure? But we validate when the policy is created to ensure the profile exists.
Something like:
network_middlewares:
- name: openshell-secrets
middleware: openshell-secrets
config:
secrets: redact
endpoints: ["foo.com"]
provider_profiles: ["github"]
I would have to look closer where would it would need to be dereferenced.
Alt 1.
We could hoist the middleware config into a separate entity, similar to provider profiles. That would define endpoints/provider profiles and the config. This could actually be a future extension, the benefit here is that you configure the middleware once and you don't need to repeat the config in each sandbox's policy.
Alt 2.
We could leave it as-is for now and in the initial version you couldn't configure the middleware for a provider. We could wait and see how this is used, it's possible that we won't need provider profile to control granularity for attaching middleware? Maybe that inclusion would be configured on based on different dimensions, e.g. "inference".
I think for now I'd lean toward alt 2. CC @kirit93 I'm curious what your take is.
|
|
||
| When more than one middleware applies to a request, the order is well-defined: | ||
|
|
||
| - Middleware configs are defined once in a top-level `network_middlewares` list and attached to traffic from network policies or individual endpoints through an explicit `middleware: [...]` list. Policy-level lists apply to all endpoints in the policy; endpoint-level lists apply only to that endpoint and append after the policy-level list. These lists determine the relative order of the configs they name. |
There was a problem hiding this comment.
Should we say middleware configs must be defined in the policy and only the policy? Meaning they are not supported in provider profiles?
There was a problem hiding this comment.
One use case where I can see adding middlewares in the provider profile (pp) is for things like sigv4, if I'm creating an AWS provider, I'd like to configure sigv4 middleware. That middleware would be built-in, so it's always available and is not relying on gateway configuration having that middleware added. We could limit it to built-in ones and configuration in the endpoint and not global network_middlewares: ... section.
Another place where middleware in pp is useful is reusing the same config in multiple sandboxes. Right now it's in the policy, so I need to include it in each sandbox and if I want to update it, I need to do it manually in all sandboxes. But that doesn't feel like something providers should deal with, if we were to go this route to allow reuse/reference, it may be better to go with alt 1 from the other comment.
|
|
||
| ### The middleware contract | ||
|
|
||
| The contract has two parts: a configuration-time handshake and a request-time hook. The request-time hook runs on the *hot path* - the synchronous, per-request path through the supervisor proxy, as opposed to the control-plane path used to fetch config. Middleware only sits on this path for sandboxes whose policy configures it: a sandbox with no middleware in its policy is unaffected and pays no per-request cost. Middleware is therefore an explicit opt-in, and this change is transparent to existing usage. |
There was a problem hiding this comment.
Should we consider a health check? Or just rely on the on_error setting?
There was a problem hiding this comment.
I think relying on the on_error setting would be enough for now. The health check could be configurable and could then be used to alert on availability early, i.e. before a request is made.
If the health check failed, the next request that comes in would assume "error" and fail fast, rather than try to call the endpoint and timeout, right? So it would be an optimization/availability improvement, so that requests going out of the sandbox fail fast (or succeed if it's on_error: allow), instead of potentially waiting for timeouts, etc.
WDYT?
This comment was marked as resolved.
This comment was marked as resolved.
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
|
There is a new RFC 0010 (gateway interceptors) that is similar in some ways: #1927. We synced up with @drew and came up with a list of items to align on between this RFC 0010: #1927 (comment) |
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Note
The RFC is now open for feedback. The contract is framed as a research preview (provisional, may change incompatibly), and a handful of design Open questions remain - they are listed in the RFC.
Summary
Draft of RFC 0009 - Supervisor Middleware. Proposes hooks in the supervisor that can inspect, transform, allow/deny, and annotate supervisor-managed operations. The v1 hook is HTTP request middleware in the supervisor proxy, before OpenShell injects upstream credentials. Privacy Guard remains the motivating use case, but the naming is intentionally broader than egress so the design can grow to other supervisor hooks later.
The main
README.mdholds the proposal end to end; two appendices carry supporting detail (deployment options, protocol extensions).Tip
See here for more visual representation of this RFC. Inspired by @mrunalp (source)
Approach summary
This RFC has been developed together with AI helpers. The final RFC is distilled from lots of source material that is not included to keep it somewhat short.
Approach details
Here's how my rough approach looked like:
TRACKER.mddoc, something high level tracking the status of the distillation which allowed going through the doc step-by-step, while keeping a memory of what's in-flight, place to drop random thoughts in. (It has since been removed from the RFC now that the draft is complete.)Why I'm writing this down -> future reference and also to get feedback on this part. I'm curious how others are approaching writing docs like this, so please share if you have thoughts!
Related Issue
Changes
rfc/0009-supervisor-middleware/README.md- the full draft: Summary, Motivation, Privacy Guard use case, Non-goals, Terminology, Proposal (architecture, hook placement, contract + proto sketch, contract versioning, registration/delivery + reload semantics, policy integration, ordering, metadata, OCSF audit/logging), relationship to RFC 0010, Prior art, Implementation plan, Risks, Alternatives, and Open questions.rfc/0009-supervisor-middleware/appendices/deployment-options.md- deployment-mode decision and future options.rfc/0009-supervisor-middleware/appendices/protocol-extensions.md- future streaming operations, additional hooks, semantic context, content preview, portable capabilities, header rules, and middleware authentication.Notes
RFC numbering conflict
This RFC was renumbered from
0005to0009after the draft opened, avoiding a collision with other in-progress RFCs.Follow-up process note: reserve RFC numbers and explicitly allow non-continuous numbering. Gaps in the sequence are fine; what matters is that a number is uniquely claimed once an RFC is in progress. Reserving numbers makes it easier to talk about in-flight RFCs - e.g. "someone proposed X in RFC 4", or "RFC 4 and RFC 6 are both in progress and overlap on X" - without two documents fighting over the same identifier.
Supervisor middleware rename
The feature name moved from "sandbox egress middleware" to "supervisor middleware" so the API can remain extensible for future supervisor hook types beyond egress HTTP requests. The v1 hook is still
HttpRequest/pre_credentials.Appendices
I kept only two appendices (deployment-options, protocol-extensions). Four other detailed appendices were planned (request/response contract, policy integration, pipeline placement, failure-and-audit) but dropped: their load-bearing content was inlined into the README so the RFC stays self-contained, and the rest was implementation-spec detail that doesn't belong in a design doc.
The drafting tracker (
TRACKER.md) has been removed from the RFC now that the draft is complete.Testing
N/A - documentation only.
Checklist
Changelog
protocol(onlytls: skip/opaque TCP bypasses it); clarify the Route selection step; add the v1 in/out-of-scope list (WebSocket upgrade in, frames out); make over-cap =on_error(fail-closed) explicit; document middleware chain composability; add the built-in-onlyhttp.request.post_credentialshook; note registration ergonomics as deferred.rfc/0009-sandbox-egress-middleware; address review feedback around reserved middleware namespaces, hook naming, route-selection scope, trusted middleware endpoints, optional actor data,Findingshape, append-only headers,endpointsselectors, provider-profile scope, multitenancy, health checks, future deployment timelines, and metadata-only response-completion notifications.GetCapabilitiestoDescribe; model service-owned middleware bindings; keep v1 evaluation unary withEvaluateHttpRequest; narrowon_errortofail_open/fail_closed; update the sample evaluation/result proto shape aroundbinding_id,HttpRequestTarget,originating_process, flattened result fields, andhas_body.