Skip to content

Feat: Add new identity api to DPU agent metadata service#1324

Merged
prbinu-nvidia merged 8 commits into
NVIDIA:mainfrom
prbinu-nvidia:feat/spiffe-imds-api
May 5, 2026
Merged

Feat: Add new identity api to DPU agent metadata service#1324
prbinu-nvidia merged 8 commits into
NVIDIA:mainfrom
prbinu-nvidia:feat/spiffe-imds-api

Conversation

@prbinu-nvidia
Copy link
Copy Markdown
Contributor

Description

This PR changes is scoped to the feature commit series (tickets #811, #812, #1177).


Purpose

Expose machine identity over EC2-style IMDS HTTP paths (e.g. …/meta-data/identity) with:

  • Rate limiting and timeouts
  • Optional HTTP sign-proxy (forward to another IMDS origin)
  • Shared logic between embedded metadata in forge-dpu-agent and standalone carbide-fmds

New shared crate: carbide-dpu-fmds-shared (crates/dpu-fmds-shared/)

  • Centralizes GET …/meta-data/identity behavior:
    • Metadata: true SSRF requirement; optional Accept: text/plain vs JSON body
    • aud query parsing
    • Governor rate limit (wait_identity_rate_limit_permit)
    • Forge path: sign_machine_identity_with_forge
    • Optional HTTP sign-proxy: forward_sign_proxy_if_ready / forward_sign_proxy_http (forwards Metadata / Accept, returns upstream status/body)
    • map_grpc_status_to_http for error mapping
  • Defines trait MetaDataIdentitySigner (permit → optional proxy → Forge sign).

forge-dpu-agent (crates/agent/)

  • InstanceMetadataRouterState / InstanceMetadataRouterStateImpl: implements MetaDataIdentitySigner; serve_meta_data_identity delegates to shared machine_identity::serve_meta_data_identity.
  • Config: new [machine-identity] TOML section (kebab-case, consistent with [forge-system], [period], etc.):
    • requests-per-second, burst, wait-timeout-secs, sign-timeout-secs
    • optional sign-proxy-url / sign-proxy-tls-root-ca
  • Example + example_agent_config.toml updated.
  • Main loop / FMDS client: wiring so identity behavior stays consistent with FMDS updates where applicable.
  • Routing: paths use standard IMDS shapes (e.g. /latest/meta-data/identity, /2009-04-04/…).

carbide-dpu-agent-utils (crates/dpu-agent-utils/)

  • New machine_identity: defaults and limits (valid ranges) reused by host config and FMDS proto mapping.

carbide-host-support (crates/host-support/)

  • MachineIdentityConfig: deserialize / validate [machine-identity], with unit tests (e.g. TOML load).

Standalone FMDS (crates/fmds/)

  • identity_signer.rs: MetaDataIdentitySigner for FmdsState (Forge + sign-proxy + governor).
  • gRPC UpdateConfig: optional FmdsMachineIdentityConfig.
  • REST metadata routes updated to use shared identity behavior where relevant.

Protobuf (crates/rpc/proto/fmds.proto)

  • FmdsConfigUpdate: optional machine_identity (mirrors agent [machine-identity] for standalone FMDS).

API / model / misc (review feedback from PR #972)

  • api-model / api: small adjustments around tenant identity config (handler + model).
  • secrets: minor key_encryption touch (shared types / compatibility).
  • Docs: spiffe-svid-sdd.md updates.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

tickets #811, #812, #1177

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Epic: #261

@prbinu-nvidia prbinu-nvidia requested review from a team and Coco-Ben as code owners May 1, 2026 19:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

}

#[inline]
pub fn requests_per_second(&self) -> u8 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: None of these methods need to exist... You can just make these fields pub.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

impl MachineIdentityParams {
/// Defaults match [`forge_dpu_agent_utils::machine_identity::defaults`] (agent
/// `MachineIdentityConfig` serde and this crate).
pub fn fmds_default() -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This should just be impl Default for MachineIdentityParams.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

/// [`TryFrom::try_from`] — preferred for protobuf (idiomatic “parse” entry point).
pub fn try_from_fmds_proto(p: &FmdsMachineIdentityConfig) -> Result<Self, String> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This doesn't need to exist, callers can just use try_from directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Self::try_from(p)
}

pub fn to_fmds_proto(&self) -> FmdsMachineIdentityConfig {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This should probably just be impl From<MachineIdentityParams> for FmdsMachineIdentityConfig

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

burst: u8,
wait_timeout_secs: u8,
sign_timeout_secs: u8,
sign_proxy_url: Option<String>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably just be a url::Url instead of a String so that we don't need all this logic for trim/is_empty checks.

/// Computes key_id as hex(sha256(public_key)).
/// Works with any public key representation (PEM, DER, etc.).
///
/// API domain code should prefer `KeyId::from_public_key_material` in `carbide-api-model`, which
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just delete this function and inline it into KeyId::from_public_material then... it looks like only that function is calling it, plus some tests below.

#[async_trait]
pub trait MetaDataIdentitySigner: Send + Sync {
/// Rate-limit permit (governor) before signing or proxying.
async fn wait_identity_permit(&self) -> Result<(), tonic::Status>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like making this a separate function leaves us open for callers to forget to call it at some point. Could we make this an implementation detail of sign_machine_identity instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/// the response. Returns [`None`] when the Forge (`SignMachineIdentity`) path should be used
/// instead. Implementations should hold any `Arc`/lock guard across the `.await` so borrows into
/// sign-proxy config remain valid (see [`forward_sign_proxy_if_ready`] / [`forward_sign_proxy_http`]).
async fn forward_sign_proxy_if_configured(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, it seems strange that this is its own method rather than logic that goes in the definition of sign_machine_identity... making callers remember to call this first (and not call sign_machine_identity if it succeeds) feels backwards for an abstraction like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pub async fn wait_identity_rate_limit_permit(
governor: &Arc<IdentityRateLimiter>,
wait_timeout: Duration,
) -> Result<(), tonic::Status> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be returning a tonic::Status? It's not doing anything tonic related at all. Maybe just return Result<(), tokio::time::error::Elapsed>, and this becomes:

pub async fn wait_identity_rate_limit_permit(
    governor: &Arc<IdentityRateLimiter>,
    wait_timeout: Duration,
) -> Result<(), tokio::time::error::Elapsed> {
    let lim = Arc::clone(governor);
    tokio::time::timeout(wait_timeout, lim.until_ready()).await
}

Conversion to a tonic::Status can happen via whoever's calling this (although it looks like you're just converting it to an HTTP status anyway, so you can just move the "timed out waiting ..." string to the response there and return an error 429.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

})
}

pub fn map_grpc_status_to_http(status: &tonic::Status) -> StatusCode {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I'm not really sure we should be blindly mapping the carbide-api error to our own... Like for instance if we get PermissionDenied from carbide-api, that's carbide denying the DPU access to the signing request, which is not the same as giving a PermissionDenied to the caller here. Likewise Unauthenticated, etc.

To me it seems more natural for the GRPC call to be a total implementation detail, and if the backend signing request fails, just return a 503 Service Unavailable. With the exception of maybe InvalidArgument which could map to a BAD_REQUEST. (And even that is only the case if the aud field is invalid... there's lots of other InvalidArgument code paths that are actually the DPU's fault.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Forgot to fill in the toplevel review field before I hit submit, sorry...)

I guess I don't really have any important push-back here other than a lot of nitpicks... AI still has this habit of dumping one-off functions everywhere and not really organizing code in a logical way. The MetaDataIdentitySigner trait having three different functions that callers have to remember to call in-order is the biggest head-scratcher for me, and the thing I feel "strongest" about, but otherwise if you disagree feel free to merge.

@prbinu-nvidia prbinu-nvidia force-pushed the feat/spiffe-imds-api branch 2 times, most recently from 2ac3257 to 92b8b20 Compare May 5, 2026 15:02
@prbinu-nvidia prbinu-nvidia force-pushed the feat/spiffe-imds-api branch from 92b8b20 to b856be2 Compare May 5, 2026 17:52
@prbinu-nvidia prbinu-nvidia enabled auto-merge (squash) May 5, 2026 17:52
@prbinu-nvidia prbinu-nvidia merged commit 3c19ec9 into NVIDIA:main May 5, 2026
45 checks passed
inf0rmatiker pushed a commit to inf0rmatiker/infra-controller-core that referenced this pull request May 7, 2026
## Description

This PR changes is scoped to the feature commit series (tickets
**NVIDIA#811**, **NVIDIA#812**, **NVIDIA#1177**).

---
## Purpose
Expose **machine identity** over **EC2-style IMDS HTTP paths** (e.g.
`…/meta-data/identity`) with:
- **Rate limiting** and **timeouts**
- Optional **HTTP sign-proxy** (forward to another IMDS origin)
- **Shared logic** between **embedded metadata in `forge-dpu-agent`**
and **standalone `carbide-fmds`**
---
## New shared crate: `carbide-dpu-fmds-shared`
(`crates/dpu-fmds-shared/`)
- Centralizes `GET …/meta-data/identity` behavior:
- `Metadata: true` SSRF requirement; optional `Accept: text/plain` vs
JSON body
  - `aud` query parsing
  - **Governor** rate limit (`wait_identity_rate_limit_permit`)
  - **Forge** path: `sign_machine_identity_with_forge`
- Optional **HTTP sign-proxy**: `forward_sign_proxy_if_ready` /
`forward_sign_proxy_http` (forwards `Metadata` / `Accept`, returns
upstream status/body)
  - `map_grpc_status_to_http` for error mapping
- Defines trait `MetaDataIdentitySigner` (permit → optional proxy →
Forge sign).
---
## `forge-dpu-agent` (`crates/agent/`)
- `InstanceMetadataRouterState` / `InstanceMetadataRouterStateImpl`:
implements `MetaDataIdentitySigner`; `serve_meta_data_identity`
delegates to shared `machine_identity::serve_meta_data_identity`.
- **Config**: new `[machine-identity]` TOML section (kebab-case,
consistent with `[forge-system]`, `[period]`, etc.):
- `requests-per-second`, `burst`, `wait-timeout-secs`,
`sign-timeout-secs`
  - optional `sign-proxy-url` / `sign-proxy-tls-root-ca`
- **Example** + `example_agent_config.toml` updated.
- **Main loop / FMDS client**: wiring so identity behavior stays
consistent with FMDS updates where applicable.
- **Routing**: paths use standard IMDS shapes (e.g.
`/latest/meta-data/identity`, `/2009-04-04/…`).
---
## `carbide-dpu-agent-utils` (`crates/dpu-agent-utils/`)
- New `machine_identity`: **defaults** and **limits** (valid ranges)
reused by host config and FMDS proto mapping.
---
## `carbide-host-support` (`crates/host-support/`)
- **`MachineIdentityConfig`**: deserialize / validate
`[machine-identity]`, with unit tests (e.g. TOML load).
---
## Standalone FMDS (`crates/fmds/`)
- **`identity_signer.rs`**: `MetaDataIdentitySigner` for `FmdsState`
(Forge + sign-proxy + governor).
- **gRPC `UpdateConfig`**: optional `FmdsMachineIdentityConfig`.
- **REST** metadata routes updated to use shared identity behavior where
relevant.
---
## Protobuf (`crates/rpc/proto/fmds.proto`)
- **`FmdsConfigUpdate`**: optional `machine_identity` (mirrors agent
`[machine-identity]` for standalone FMDS).
---
## API / model / misc (review feedback from PR NVIDIA#972)
- **`api-model`** / `api`: small adjustments around **tenant identity
config** (handler + model).
- **`secrets`**: minor `key_encryption` touch (shared types /
compatibility).
- **Docs**: `spiffe-svid-sdd.md` updates.
---



## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality  
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
tickets **NVIDIA#811**, **NVIDIA#812**, **NVIDIA#1177**

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
[Epic: NVIDIA#261](NVIDIA#261)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants