Skip to content

feat: surface TLS error cause chain in forge_tls_client logs (refs #1…#1160

Merged
ajf merged 4 commits into
NVIDIA:mainfrom
rpowers-nv:feat/1088-mtls-error-messages
May 6, 2026
Merged

feat: surface TLS error cause chain in forge_tls_client logs (refs #1…#1160
ajf merged 4 commits into
NVIDIA:mainfrom
rpowers-nv:feat/1088-mtls-error-messages

Conversation

@rpowers-nv
Copy link
Copy Markdown
Contributor

@rpowers-nv rpowers-nv commented Apr 27, 2026

Description

mTLS connection failures from forge_tls_client were logged as
'Unknown error', "client error (Connect)" with no indication TLS was
involved. Root cause: tonic::Status wraps the underlying transport
error in its source() chain, but the standard library Display impl
for errors doesn't recurse — so logging with {} (or to_string())
drops the rustls/hyper detail underneath.

This change adds a small private format_error_chain helper that walks
std::error::Error::source() and joins each level with : , then uses
it at the four log/wrap sites in crates/rpc/src/forge_tls_client.rs
(per-attempt log + Connection(String) wrapping, in both retry_build
and retry_build_nmx_c).

To exercise the change locally, we created a CA mismatch as a
representative mTLS failure mode. Same client log line, before vs. after:

Before:
... will retry: status: 'Unknown error', self: "client error (Connect)"

After:
... will retry: status: 'Unknown error', self: "client error (Connect)":
client error (Connect): invalid peer certificate: UnknownIssuer

The same code path is hit by every mTLS client of carbide-api (DHCP,
machine-a-tron, etc.) — they all funnel through
ForgeTlsClient::retry_build, so this benefits all of them.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Closes #1088

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Two new unit tests verify format_error_chain walks a chain and handles
the no-source case. Manual end-to-end: reproduced the original failure
mode against machine-a-tron on a local kind cluster, captured before/
after log lines (shown above).

Additional Notes

The fix is generic — it surfaces whatever the underlying error chain
contains, not just cert errors. Other mTLS failures (handshake errors,
expired certs, hostname mismatches) will get the same treatment.

@rpowers-nv rpowers-nv requested a review from a team as a code owner April 27, 2026 19:06
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 30, 2026

/ok to test 2b583fd

@poroh poroh enabled auto-merge (squash) April 30, 2026 06:36
@poroh
Copy link
Copy Markdown
Contributor

poroh commented Apr 30, 2026

/ok to test 956b40b

@github-actions
Copy link
Copy Markdown

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 4, 2026

/ok to test e499a3f

auto-merge was automatically disabled May 4, 2026 14:36

Head branch was pushed to by a user without write access

@rpowers-nv rpowers-nv force-pushed the feat/1088-mtls-error-messages branch from e499a3f to 5566858 Compare May 4, 2026 14:36
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 4, 2026

/ok to test 815a6d4

@NVIDIA NVIDIA deleted a comment from copy-pr-bot Bot May 4, 2026
@rpowers-nv rpowers-nv force-pushed the feat/1088-mtls-error-messages branch from db7c528 to ca91022 Compare May 5, 2026 16:09
@rpowers-nv rpowers-nv requested review from a team, Coco-Ben and shayan1995 as code owners May 5, 2026 16:09
rpowers-nv and others added 3 commits May 5, 2026 12:16
Signed-off-by: rpowers <rpowers@nvidia.com>
Signed-off-by: rpowers <rpowers@nvidia.com>
@rpowers-nv rpowers-nv force-pushed the feat/1088-mtls-error-messages branch from ca91022 to ff22665 Compare May 5, 2026 16:17
Copy link
Copy Markdown
Contributor

@akorobkov-nvda akorobkov-nvda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 5, 2026

/ok to test 35fe69e

@rpowers-nv
Copy link
Copy Markdown
Contributor Author

@ajf I am unable to merge myself so let me know if there is anything else I should do here.

@ajf ajf merged commit d9c9d68 into NVIDIA:main May 6, 2026
44 checks passed
inf0rmatiker pushed a commit to inf0rmatiker/infra-controller that referenced this pull request May 7, 2026
…IDIA#1… (NVIDIA#1160)

## Description

  mTLS connection failures from `forge_tls_client` were logged as
  `'Unknown error', "client error (Connect)"` with no indication TLS was
  involved. Root cause: `tonic::Status` wraps the underlying transport
  error in its `source()` chain, but the standard library `Display` impl
  for errors doesn't recurse — so logging with `{}` (or `to_string()`)
  drops the rustls/hyper detail underneath.

This change adds a small private `format_error_chain` helper that walks
`std::error::Error::source()` and joins each level with `: `, then uses
  it at the four log/wrap sites in `crates/rpc/src/forge_tls_client.rs`
(per-attempt log + `Connection(String)` wrapping, in both `retry_build`
  and `retry_build_nmx_c`).

  To exercise the change locally, we created a CA mismatch as a
representative mTLS failure mode. Same client log line, before vs.
after:

  Before:
... will retry: status: 'Unknown error', self: "client error (Connect)"

  After:
... will retry: status: 'Unknown error', self: "client error (Connect)":
    client error (Connect): invalid peer certificate: UnknownIssuer

  The same code path is hit by every mTLS client of carbide-api (DHCP,
  machine-a-tron, etc.) — they all funnel through
  `ForgeTlsClient::retry_build`, so this benefits all of them.

  ## Type of Change
  - [ ] **Add** - New feature or capability
  - [x] **Change** - Changes in existing functionality
  - [ ] **Fix** - Bug fixes
  - [ ] **Remove** - Removed features or deprecated functionality
  - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

  ## Related Issues (Optional)

  Closes NVIDIA#1088

  ## Breaking Changes
  - [ ] This PR contains breaking changes

  ## Testing
  - [x] Unit tests added/updated
  - [ ] Integration tests added/updated
  - [x] Manual testing performed
  - [ ] No testing required (docs, internal refactor, etc.)

Two new unit tests verify `format_error_chain` walks a chain and handles
  the no-source case. Manual end-to-end: reproduced the original failure
mode against `machine-a-tron` on a local kind cluster, captured before/
  after log lines (shown above).

  ## Additional Notes

  The fix is generic — it surfaces whatever the underlying error chain
  contains, not just cert errors. Other mTLS failures (handshake errors,
  expired certs, hostname mismatches) will get the same treatment.

---------

Signed-off-by: rpowers <rpowers@nvidia.com>
Co-authored-by: Alexander Korobkov <akorobkov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: user friendly error message for mTLS error during connection

5 participants