feat: surface TLS error cause chain in forge_tls_client logs (refs #1…#1160
Merged
Conversation
wminckler
approved these changes
Apr 28, 2026
poroh
approved these changes
Apr 30, 2026
Contributor
|
/ok to test 2b583fd |
Contributor
|
/ok to test 956b40b |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-1160.docs.buildwithfern.com/infra-controller |
Collaborator
|
/ok to test e499a3f |
auto-merge was automatically disabled
May 4, 2026 14:36
Head branch was pushed to by a user without write access
e499a3f to
5566858
Compare
Collaborator
|
/ok to test 815a6d4 |
db7c528 to
ca91022
Compare
…IDIA#1088) Signed-off-by: rpowers <rpowers@nvidia.com>
Signed-off-by: rpowers <rpowers@nvidia.com>
Signed-off-by: rpowers <rpowers@nvidia.com>
ca91022 to
ff22665
Compare
Collaborator
|
/ok to test 35fe69e |
Contributor
Author
|
@ajf I am unable to merge myself so let me know if there is anything else I should do here. |
inf0rmatiker
pushed a commit
to inf0rmatiker/infra-controller
that referenced
this pull request
May 7, 2026
…IDIA#1… (NVIDIA#1160) ## Description mTLS connection failures from `forge_tls_client` were logged as `'Unknown error', "client error (Connect)"` with no indication TLS was involved. Root cause: `tonic::Status` wraps the underlying transport error in its `source()` chain, but the standard library `Display` impl for errors doesn't recurse — so logging with `{}` (or `to_string()`) drops the rustls/hyper detail underneath. This change adds a small private `format_error_chain` helper that walks `std::error::Error::source()` and joins each level with `: `, then uses it at the four log/wrap sites in `crates/rpc/src/forge_tls_client.rs` (per-attempt log + `Connection(String)` wrapping, in both `retry_build` and `retry_build_nmx_c`). To exercise the change locally, we created a CA mismatch as a representative mTLS failure mode. Same client log line, before vs. after: Before: ... will retry: status: 'Unknown error', self: "client error (Connect)" After: ... will retry: status: 'Unknown error', self: "client error (Connect)": client error (Connect): invalid peer certificate: UnknownIssuer The same code path is hit by every mTLS client of carbide-api (DHCP, machine-a-tron, etc.) — they all funnel through `ForgeTlsClient::retry_build`, so this benefits all of them. ## Type of Change - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) Closes NVIDIA#1088 ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [x] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Two new unit tests verify `format_error_chain` walks a chain and handles the no-source case. Manual end-to-end: reproduced the original failure mode against `machine-a-tron` on a local kind cluster, captured before/ after log lines (shown above). ## Additional Notes The fix is generic — it surfaces whatever the underlying error chain contains, not just cert errors. Other mTLS failures (handshake errors, expired certs, hostname mismatches) will get the same treatment. --------- Signed-off-by: rpowers <rpowers@nvidia.com> Co-authored-by: Alexander Korobkov <akorobkov@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
mTLS connection failures from
forge_tls_clientwere logged as'Unknown error', "client error (Connect)"with no indication TLS wasinvolved. Root cause:
tonic::Statuswraps the underlying transporterror in its
source()chain, but the standard libraryDisplayimplfor errors doesn't recurse — so logging with
{}(orto_string())drops the rustls/hyper detail underneath.
This change adds a small private
format_error_chainhelper that walksstd::error::Error::source()and joins each level with:, then usesit at the four log/wrap sites in
crates/rpc/src/forge_tls_client.rs(per-attempt log +
Connection(String)wrapping, in bothretry_buildand
retry_build_nmx_c).To exercise the change locally, we created a CA mismatch as a
representative mTLS failure mode. Same client log line, before vs. after:
Before:
... will retry: status: 'Unknown error', self: "client error (Connect)"
After:
... will retry: status: 'Unknown error', self: "client error (Connect)":
client error (Connect): invalid peer certificate: UnknownIssuer
The same code path is hit by every mTLS client of carbide-api (DHCP,
machine-a-tron, etc.) — they all funnel through
ForgeTlsClient::retry_build, so this benefits all of them.Type of Change
Related Issues (Optional)
Closes #1088
Breaking Changes
Testing
Two new unit tests verify
format_error_chainwalks a chain and handlesthe no-source case. Manual end-to-end: reproduced the original failure
mode against
machine-a-tronon a local kind cluster, captured before/after log lines (shown above).
Additional Notes
The fix is generic — it surfaces whatever the underlying error chain
contains, not just cert errors. Other mTLS failures (handshake errors,
expired certs, hostname mismatches) will get the same treatment.