[cosmos] Fix ClientTelemetry static-init NPE when IMDS access is disabled#48887
[cosmos] Fix ClientTelemetry static-init NPE when IMDS access is disabled#48887lfavreli-betclic wants to merge 3 commits intoAzure:mainfrom
Conversation
ClientTelemetry.<clinit> evaluates
CACHED_METADATA = fetchAzureVmMetadata().cache();
before declaring METADATA_NOT_AVAILABLE. When IMDS access is disabled
(e.g. via COSMOS_DISABLE_IMDS_ACCESS=true), fetchAzureVmMetadata() takes
the fast path:
return Mono.just(METADATA_NOT_AVAILABLE);
At that point METADATA_NOT_AVAILABLE is still null, so Mono.just throws
NullPointerException. The resulting ExceptionInInitializerError leaves
ClientTelemetry permanently un-loadable and every subsequent
CosmosClientBuilder.buildAsyncClient() call fails with:
NoClassDefFoundError: Could not initialize class
com.azure.cosmos.implementation.clienttelemetry.ClientTelemetry
Fix: move METADATA_NOT_AVAILABLE ahead of CACHED_METADATA so the sentinel
exists before fetchAzureVmMetadata() can reference it during class init.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thank you for your contribution @lfavreli-betclic! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Pull request overview
Fixes a JVM class-initialization failure in the Cosmos SDK’s ClientTelemetry by ensuring the IMDS “metadata not available” sentinel is initialized before the cached IMDS Mono is created, preventing NoClassDefFoundError cascades when IMDS access is disabled.
Changes:
- Reordered
ClientTelemetrystatic field declarations and added a clarifying comment to prevent a static-init NPE when IMDS access is disabled. - Added a release note entry describing the fix in the Cosmos library changelog.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/clienttelemetry/ClientTelemetry.java |
Reorders static initialization to avoid Mono.just(null) during <clinit> when IMDS is disabled. |
sdk/cosmos/azure-cosmos/CHANGELOG.md |
Documents the bug fix in the unreleased version’s “Bugs Fixed” section. |
| // - The fetch executes at most once | ||
| // - All concurrent subscribers share the single result | ||
| // - The HTTP client is created and disposed within the fetch | ||
| private static final Mono<AzureVMMetadata> CACHED_METADATA = fetchAzureVmMetadata().cache(); |
There was a problem hiding this comment.
Consider adding a regression test to cover the class-initialization scenario this change fixes (IMDS access disabled causing ClientTelemetry to become unloadable). There is already child-JVM test infrastructure in azure-cosmos-tests (e.g., ImplementationBridgeHelpersTest uses ProcessBuilder) that could run a small main with -DCOSMOS.DISABLE_IMDS_ACCESS=true (or COSMOS_DISABLE_IMDS_ACCESS=true) before any Cosmos classes load, then assert that a CosmosClientBuilder.buildAsyncClient() call completes without ExceptionInInitializerError/NoClassDefFoundError. This would prevent future static-field reordering regressions.
@microsoft-github-policy-service agree company="Betclic Group" |
| // Must be declared before CACHED_METADATA so that fetchAzureVmMetadata() | ||
| // never reads a null value during class initialization (e.g. when IMDS | ||
| // access is disabled via COSMOS_DISABLE_IMDS_ACCESS). | ||
| private static final AzureVMMetadata METADATA_NOT_AVAILABLE = new AzureVMMetadata(); |
There was a problem hiding this comment.
Like you mentioned in the PR description - moving to static initialization block to also fix the fragile IMDS_DEFAULT_* fields - would be my preference. I can take the additional changes (adding test coverage and making this change form here if you don't wnat to spend more time on this - perfectly understandable). If you want to finish/merge the PR please let me know when these changes are added and I will quickly re-review and approve.
There was a problem hiding this comment.
Thanks a lot, @FabianMeiswinkel, for your quick review!
You have a much better overall understanding of the SDK, so I’m perfectly happy to let you take over and apply the remaining changes. I hope this initial work was helpful to you. 👌
There was a problem hiding this comment.
It was and thanks for providing such detailed repro-info - really appreciated!
There was a problem hiding this comment.
Thanks @lfavreli-betclic - i have published a PR #48888 for this and will get it merged asap.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Thanks @lfavreli-betclic - as discussed, i have published a PR #48888 for this and will get it merged asap. |
Description
Fixes a
NullPointerExceptionthrown duringClientTelemetry.<clinit>when IMDS access is disabled viaCOSMOS_DISABLE_IMDS_ACCESS=true(env var) or-DCOSMOS.DISABLE_IMDS_ACCESS=true(system property). The error makesClientTelemetrypermanently un-loadable for the lifetime of the JVM, so every subsequentCosmosClientBuilder.buildAsyncClient()fails withNoClassDefFoundError.First shipped in 4.79.0 when
CACHED_METADATAwas introduced as a statically-initialized field; 4.78.0 is unaffected.Root cause
In
ClientTelemetrythe fields are declared in this order:<clinit>evaluates them top-down. At (1),fetchAzureVmMetadata()runs and, when IMDS is disabled, eagerly returns:But
METADATA_NOT_AVAILABLEhasn't been assigned yet, soMono.just(null)throws NPE viaObjects.requireNonNull. This escapes asExceptionInInitializerError, and every later reference to the class surfaces asNoClassDefFoundError: Could not initialize class ... ClientTelemetry.Fix
Declare
METADATA_NOT_AVAILABLEbeforeCACHED_METADATAso the sentinel exists beforefetchAzureVmMetadata()can read it. A short comment documents the ordering invariant.Diff is 2 files, +7/-3 , reordering only, no behavioral change.
Why it matters
COSMOS_DISABLE_IMDS_ACCESSis the documented opt-out for non-Azure environments (local dev, CI, other clouds), where IMDS at169.254.169.254is not routable and the blind probe costs a multi-second timeout per cold start. Users who set it today on 4.79.x cannot build a Cosmos client at all.Reproduction
On 4.79.0 / 4.79.1 (before this PR):
With this PR: the client builds normally;
fetchAzureVmMetadata()returns aMono.just(METADATA_NOT_AVAILABLE)that resolves to the sentinel on subscription, exactly as intended.Note for reviewers (out of scope)
The same declaration-order pattern affects
IMDS_AZURE_VM_METADATAand the threeIMDS_DEFAULT_*timeout fields, they are also read insidefetchAzureVmMetadata()and declared afterCACHED_METADATA. Today that path only runs when IMDS is enabled and happens to toleratenull/0values at config build time, so it doesn't NPE; but the invariant is fragile. I kept this PR strictly minimal (single-field move + comment). Happy to move those fields too, or replace both declarations with astatic { … }block, in a follow-up if you prefer.All SDK Contribution checklist
General Guidelines and Best Practices
Testing Guidelines
Pull request includes test coverage for the included changes.
Rationale: the bug lives in class-static-initializer ordering. A test that reliably exercises it would need to fork a JVM with
COSMOS.DISABLE_IMDS_ACCESS=truebeforeClientTelemetryis first referenced, then observe a successfulbuildAsyncClient(). Happy to add such a harness (e.g. a@TempJvmProperty+ separate JUnit test process) if maintainers think it's worth the complexity for a one-line ordering fix, otherwise the existing integration tests running in an environment whereCOSMOS.DISABLE_IMDS_ACCESS=trueis set will cover it implicitly.