Feat: Add new identity api to DPU agent metadata service#1324
Conversation
|
🌿 Preview your docs: https://nvidia-preview-pull-request-1324.docs.buildwithfern.com/infra-controller |
| } | ||
|
|
||
| #[inline] | ||
| pub fn requests_per_second(&self) -> u8 { |
There was a problem hiding this comment.
Nit: None of these methods need to exist... You can just make these fields pub.
| impl MachineIdentityParams { | ||
| /// Defaults match [`forge_dpu_agent_utils::machine_identity::defaults`] (agent | ||
| /// `MachineIdentityConfig` serde and this crate). | ||
| pub fn fmds_default() -> Self { |
There was a problem hiding this comment.
Nit: This should just be impl Default for MachineIdentityParams.
| } | ||
|
|
||
| /// [`TryFrom::try_from`] — preferred for protobuf (idiomatic “parse” entry point). | ||
| pub fn try_from_fmds_proto(p: &FmdsMachineIdentityConfig) -> Result<Self, String> { |
There was a problem hiding this comment.
Nit: This doesn't need to exist, callers can just use try_from directly.
| Self::try_from(p) | ||
| } | ||
|
|
||
| pub fn to_fmds_proto(&self) -> FmdsMachineIdentityConfig { |
There was a problem hiding this comment.
Nit: This should probably just be impl From<MachineIdentityParams> for FmdsMachineIdentityConfig
| burst: u8, | ||
| wait_timeout_secs: u8, | ||
| sign_timeout_secs: u8, | ||
| sign_proxy_url: Option<String>, |
There was a problem hiding this comment.
This should probably just be a url::Url instead of a String so that we don't need all this logic for trim/is_empty checks.
| /// Computes key_id as hex(sha256(public_key)). | ||
| /// Works with any public key representation (PEM, DER, etc.). | ||
| /// | ||
| /// API domain code should prefer `KeyId::from_public_key_material` in `carbide-api-model`, which |
There was a problem hiding this comment.
We should probably just delete this function and inline it into KeyId::from_public_material then... it looks like only that function is calling it, plus some tests below.
| #[async_trait] | ||
| pub trait MetaDataIdentitySigner: Send + Sync { | ||
| /// Rate-limit permit (governor) before signing or proxying. | ||
| async fn wait_identity_permit(&self) -> Result<(), tonic::Status>; |
There was a problem hiding this comment.
It seems like making this a separate function leaves us open for callers to forget to call it at some point. Could we make this an implementation detail of sign_machine_identity instead?
| /// the response. Returns [`None`] when the Forge (`SignMachineIdentity`) path should be used | ||
| /// instead. Implementations should hold any `Arc`/lock guard across the `.await` so borrows into | ||
| /// sign-proxy config remain valid (see [`forward_sign_proxy_if_ready`] / [`forward_sign_proxy_http`]). | ||
| async fn forward_sign_proxy_if_configured( |
There was a problem hiding this comment.
Similarly, it seems strange that this is its own method rather than logic that goes in the definition of sign_machine_identity... making callers remember to call this first (and not call sign_machine_identity if it succeeds) feels backwards for an abstraction like this.
| pub async fn wait_identity_rate_limit_permit( | ||
| governor: &Arc<IdentityRateLimiter>, | ||
| wait_timeout: Duration, | ||
| ) -> Result<(), tonic::Status> { |
There was a problem hiding this comment.
I don't think this should be returning a tonic::Status? It's not doing anything tonic related at all. Maybe just return Result<(), tokio::time::error::Elapsed>, and this becomes:
pub async fn wait_identity_rate_limit_permit(
governor: &Arc<IdentityRateLimiter>,
wait_timeout: Duration,
) -> Result<(), tokio::time::error::Elapsed> {
let lim = Arc::clone(governor);
tokio::time::timeout(wait_timeout, lim.until_ready()).await
}
Conversion to a tonic::Status can happen via whoever's calling this (although it looks like you're just converting it to an HTTP status anyway, so you can just move the "timed out waiting ..." string to the response there and return an error 429.)
| }) | ||
| } | ||
|
|
||
| pub fn map_grpc_status_to_http(status: &tonic::Status) -> StatusCode { |
There was a problem hiding this comment.
To be honest I'm not really sure we should be blindly mapping the carbide-api error to our own... Like for instance if we get PermissionDenied from carbide-api, that's carbide denying the DPU access to the signing request, which is not the same as giving a PermissionDenied to the caller here. Likewise Unauthenticated, etc.
To me it seems more natural for the GRPC call to be a total implementation detail, and if the backend signing request fails, just return a 503 Service Unavailable. With the exception of maybe InvalidArgument which could map to a BAD_REQUEST. (And even that is only the case if the aud field is invalid... there's lots of other InvalidArgument code paths that are actually the DPU's fault.)
kensimon
left a comment
There was a problem hiding this comment.
(Forgot to fill in the toplevel review field before I hit submit, sorry...)
I guess I don't really have any important push-back here other than a lot of nitpicks... AI still has this habit of dumping one-off functions everywhere and not really organizing code in a logical way. The MetaDataIdentitySigner trait having three different functions that callers have to remember to call in-order is the biggest head-scratcher for me, and the thing I feel "strongest" about, but otherwise if you disagree feel free to merge.
2ac3257 to
92b8b20
Compare
…ate - dpu-fmds-shared
92b8b20 to
b856be2
Compare
## Description This PR changes is scoped to the feature commit series (tickets **NVIDIA#811**, **NVIDIA#812**, **NVIDIA#1177**). --- ## Purpose Expose **machine identity** over **EC2-style IMDS HTTP paths** (e.g. `…/meta-data/identity`) with: - **Rate limiting** and **timeouts** - Optional **HTTP sign-proxy** (forward to another IMDS origin) - **Shared logic** between **embedded metadata in `forge-dpu-agent`** and **standalone `carbide-fmds`** --- ## New shared crate: `carbide-dpu-fmds-shared` (`crates/dpu-fmds-shared/`) - Centralizes `GET …/meta-data/identity` behavior: - `Metadata: true` SSRF requirement; optional `Accept: text/plain` vs JSON body - `aud` query parsing - **Governor** rate limit (`wait_identity_rate_limit_permit`) - **Forge** path: `sign_machine_identity_with_forge` - Optional **HTTP sign-proxy**: `forward_sign_proxy_if_ready` / `forward_sign_proxy_http` (forwards `Metadata` / `Accept`, returns upstream status/body) - `map_grpc_status_to_http` for error mapping - Defines trait `MetaDataIdentitySigner` (permit → optional proxy → Forge sign). --- ## `forge-dpu-agent` (`crates/agent/`) - `InstanceMetadataRouterState` / `InstanceMetadataRouterStateImpl`: implements `MetaDataIdentitySigner`; `serve_meta_data_identity` delegates to shared `machine_identity::serve_meta_data_identity`. - **Config**: new `[machine-identity]` TOML section (kebab-case, consistent with `[forge-system]`, `[period]`, etc.): - `requests-per-second`, `burst`, `wait-timeout-secs`, `sign-timeout-secs` - optional `sign-proxy-url` / `sign-proxy-tls-root-ca` - **Example** + `example_agent_config.toml` updated. - **Main loop / FMDS client**: wiring so identity behavior stays consistent with FMDS updates where applicable. - **Routing**: paths use standard IMDS shapes (e.g. `/latest/meta-data/identity`, `/2009-04-04/…`). --- ## `carbide-dpu-agent-utils` (`crates/dpu-agent-utils/`) - New `machine_identity`: **defaults** and **limits** (valid ranges) reused by host config and FMDS proto mapping. --- ## `carbide-host-support` (`crates/host-support/`) - **`MachineIdentityConfig`**: deserialize / validate `[machine-identity]`, with unit tests (e.g. TOML load). --- ## Standalone FMDS (`crates/fmds/`) - **`identity_signer.rs`**: `MetaDataIdentitySigner` for `FmdsState` (Forge + sign-proxy + governor). - **gRPC `UpdateConfig`**: optional `FmdsMachineIdentityConfig`. - **REST** metadata routes updated to use shared identity behavior where relevant. --- ## Protobuf (`crates/rpc/proto/fmds.proto`) - **`FmdsConfigUpdate`**: optional `machine_identity` (mirrors agent `[machine-identity]` for standalone FMDS). --- ## API / model / misc (review feedback from PR NVIDIA#972) - **`api-model`** / `api`: small adjustments around **tenant identity config** (handler + model). - **`secrets`**: minor `key_encryption` touch (shared types / compatibility). - **Docs**: `spiffe-svid-sdd.md` updates. --- ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) tickets **NVIDIA#811**, **NVIDIA#812**, **NVIDIA#1177** ## Breaking Changes - [ ] This PR contains breaking changes ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes [Epic: NVIDIA#261](NVIDIA#261)
Description
This PR changes is scoped to the feature commit series (tickets #811, #812, #1177).
Purpose
Expose machine identity over EC2-style IMDS HTTP paths (e.g.
…/meta-data/identity) with:forge-dpu-agentand standalonecarbide-fmdsNew shared crate:
carbide-dpu-fmds-shared(crates/dpu-fmds-shared/)GET …/meta-data/identitybehavior:Metadata: trueSSRF requirement; optionalAccept: text/plainvs JSON bodyaudquery parsingwait_identity_rate_limit_permit)sign_machine_identity_with_forgeforward_sign_proxy_if_ready/forward_sign_proxy_http(forwardsMetadata/Accept, returns upstream status/body)map_grpc_status_to_httpfor error mappingMetaDataIdentitySigner(permit → optional proxy → Forge sign).forge-dpu-agent(crates/agent/)InstanceMetadataRouterState/InstanceMetadataRouterStateImpl: implementsMetaDataIdentitySigner;serve_meta_data_identitydelegates to sharedmachine_identity::serve_meta_data_identity.[machine-identity]TOML section (kebab-case, consistent with[forge-system],[period], etc.):requests-per-second,burst,wait-timeout-secs,sign-timeout-secssign-proxy-url/sign-proxy-tls-root-caexample_agent_config.tomlupdated./latest/meta-data/identity,/2009-04-04/…).carbide-dpu-agent-utils(crates/dpu-agent-utils/)machine_identity: defaults and limits (valid ranges) reused by host config and FMDS proto mapping.carbide-host-support(crates/host-support/)MachineIdentityConfig: deserialize / validate[machine-identity], with unit tests (e.g. TOML load).Standalone FMDS (
crates/fmds/)identity_signer.rs:MetaDataIdentitySignerforFmdsState(Forge + sign-proxy + governor).UpdateConfig: optionalFmdsMachineIdentityConfig.Protobuf (
crates/rpc/proto/fmds.proto)FmdsConfigUpdate: optionalmachine_identity(mirrors agent[machine-identity]for standalone FMDS).API / model / misc (review feedback from PR #972)
api-model/api: small adjustments around tenant identity config (handler + model).secrets: minorkey_encryptiontouch (shared types / compatibility).spiffe-svid-sdd.mdupdates.Type of Change
Related Issues (Optional)
tickets #811, #812, #1177
Breaking Changes
Testing
Additional Notes
Epic: #261