[cosmos] Fix ClientTelemetry static-init NPE when IMDS access is disabled#48888
Conversation
|
This issue was initially found by @lfavreli-betclic who also contributed the product code fix in #48887 |
There was a problem hiding this comment.
Pull request overview
This PR fixes a Cosmos SDK startup failure caused by ClientTelemetry’s class static-initializer (<clinit>) referencing a sentinel (METADATA_NOT_AVAILABLE) before it was initialized when IMDS access is disabled. The change ensures static initialization order is safe, preventing ExceptionInInitializerError/NoClassDefFoundError during client creation in non-Azure environments.
Changes:
- Reworked
ClientTelemetrystatic initialization so the sentinel and IMDS defaults are initialized beforeCACHED_METADATAis created. - Added a changelog entry documenting the fix.
- Added a regression test that forks a fresh JVM with
-DCOSMOS.DISABLE_IMDS_ACCESS=trueto validate client build does not fail due toClientTelemetrystatic init.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/clienttelemetry/ClientTelemetry.java | Moves static initialization into an ordered static {} block to prevent NPE during <clinit> when IMDS is disabled. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Adds an unreleased bug-fix entry describing the ClientTelemetry static-init fix. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ImplementationBridgeHelpersTest.java | Adds a forked-JVM regression test to catch ClientTelemetry static initializer failures with IMDS disabled. |
|
@sdkReviewAgent |
|
✅ Review complete (26:06) No new comments — existing review coverage is sufficient. Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Description
Fixes a
NullPointerExceptionthrown duringClientTelemetry.<clinit>when IMDS access is disabled viaCOSMOS_DISABLE_IMDS_ACCESS=true(env var) or-DCOSMOS.DISABLE_IMDS_ACCESS=true(system property). The error makesClientTelemetrypermanently un-loadable for the lifetime of the JVM, so every subsequentCosmosClientBuilder.buildAsyncClient()fails withNoClassDefFoundError.First shipped in 4.79.0 when
CACHED_METADATAwas introduced as a statically-initialized field; 4.78.0 is unaffected.Root cause
In
ClientTelemetrythe fields are declared in this order:<clinit>evaluates them top-down. At (1),fetchAzureVmMetadata()runs and, when IMDS is disabled, eagerly returns:But
METADATA_NOT_AVAILABLEhasn't been assigned yet, soMono.just(null)throws NPE viaObjects.requireNonNull. This escapes asExceptionInInitializerError, and every later reference to the class surfaces asNoClassDefFoundError: Could not initialize class ... ClientTelemetry.Fix
Declare
METADATA_NOT_AVAILABLEbeforeCACHED_METADATAso the sentinel exists beforefetchAzureVmMetadata()can read it. A short comment documents the ordering invariant.Diff is 2 files, +7/-3 , reordering only, no behavioral change.
Why it matters
COSMOS_DISABLE_IMDS_ACCESSis the documented opt-out for non-Azure environments (local dev, CI, other clouds), where IMDS at169.254.169.254is not routable and the blind probe costs a multi-second timeout per cold start. Users who set it today on 4.79.x cannot build a Cosmos client at all.Reproduction
On 4.79.0 / 4.79.1 (before this PR):
With this PR: the client builds normally;
fetchAzureVmMetadata()returns aMono.just(METADATA_NOT_AVAILABLE)that resolves to the sentinel on subscription, exactly as intended.Note for reviewers (out of scope)
The same declaration-order pattern affects
IMDS_AZURE_VM_METADATAand the threeIMDS_DEFAULT_*timeout fields, they are also read insidefetchAzureVmMetadata()and declared afterCACHED_METADATA. Today that path only runs when IMDS is enabled and happens to toleratenull/0values at config build time, so it doesn't NPE; but the invariant is fragile. I kept this PR strictly minimal (single-field move + comment). Happy to move those fields too, or replace both declarations with astatic { … }block, in a follow-up if you prefer.All SDK Contribution checklist
General Guidelines and Best Practices
Testing Guidelines
Pull request includes test coverage for the included changes.
Rationale: the bug lives in class-static-initializer ordering. A test that reliably exercises it would need to fork a JVM with
COSMOS.DISABLE_IMDS_ACCESS=truebeforeClientTelemetryis first referenced, then observe a successfulbuildAsyncClient(). Happy to add such a harness (e.g. a@TempJvmProperty+ separate JUnit test process) if maintainers think it's worth the complexity for a one-line ordering fix, otherwise the existing integration tests running in an environment whereCOSMOS.DISABLE_IMDS_ACCESS=trueis set will cover it implicitly.