[Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account by xinlian12 · Pull Request #49560 · Azure/azure-sdk-for-java

xinlian12 · 2026-06-18T18:42:17Z

Description

Today every CosmosClient / CosmosAsyncClient owns its own RxPartitionKeyRangeCache, even when many clients in the same JVM are configured with the same service endpoint (a common pattern for multi-tenant / multi-credential apps and frameworks that recreate clients). The routing-map data is duplicated N times and /pkranges calls fan out N times for the same containers.

This PR moves the routing-map storage to a process-wide, refcounted registry keyed by the service endpoint URI configured on CosmosClientBuilder. The fetching path (which depends on the per-client network stack, auth, collection cache, diagnostics) stays per-client.

Design

Split RxPartitionKeyRangeCache into two layers:

Storage — AsyncCacheNonBlocking<String, CollectionRoutingMap>. Account-level data, naturally shareable. Now obtained from SharedPartitionKeyRangeCacheRegistry (process-wide singleton) keyed by the service endpoint URI.
Fetcher — issues /pkranges, depends on per-client RxDocumentClientImpl, RxCollectionCache, diagnostics. Unchanged.

Scope of sharing

Two clients share the cache only when their service endpoint URIs compare equal via URI.equals (case-insensitive on host per RFC 3986). Clients configured with different endpoint URIs — including the global endpoint vs a regional endpoint of the same logical account — do not share.

The natural-looking alternative of keying by DatabaseAccount.getId() (so global + regional clients of the same account would share) was tried and rejected: the id returned from a regional endpoint is <globalId>-<service-normalised-region>, and recovering the global form requires brittle suffix-stripping against the readable/writable locations list. DatabaseAccount.getResourceId() (the _rid field) is not a documented canonical id at the protocol level. Rather than ship a fragile canonicalisation, the registry honestly keys on the builder-supplied URI.

Concurrency model

All registry state transitions go through ConcurrentHashMap.compute(...), which provides atomic per-key check-and-update.

Lifecycle

RxPartitionKeyRangeCache ctor acquires from the registry (bumps refcount).
RxPartitionKeyRangeCache implements Closeable; close() releases the refcount and is idempotent (guarded by AtomicBoolean).
RxDocumentClientImpl.close() calls LifeCycleUtils.closeQuietly(partitionKeyRangeCache).
A leak-safety net registers a one-shot cleanup with com.azure.core.util.ReferenceManager: if a client is GC'd without calling close(), the cleanup decrements the refcount once. A WARN log identifies the leaking endpoint.
When the last reference is released, the registry entry is evicted so idle endpoints don't pin memory.

Diagnostics

PARTITION_KEY_RANGE_LOOK_UP metadata diagnostics are recorded at the outer cache lookup site (tryLookupAsync, tryGetRangeByPartitionKeyRangeId) rather than only on the inner network fetch path. This is required by cache sharing: a client that serves a PK-range lookup from a cache populated by a sibling client performs no /pkranges fetch, yet its diagnostics must still record the lookup (the directDiagnostics test asserts PARTITION_KEY_RANGE_LOOK_UP is present). One record is therefore emitted per PK-range lookup regardless of cache hit/miss; on a hit the duration is the sub-millisecond lookup latency, on a miss it includes the network fetch. Because a single operation can now emit multiple PARTITION_KEY_RANGE_LOOK_UP entries (a cache hit followed by a forced-refresh fetch), FaultInjectionMetadataRequestRuleTests was updated to assert on the maximum-duration entry (the actual fetch) instead of the first.

Opt-out

System property COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED=false restores per-client private caches.

Files

File	Change
`caches/SharedPartitionKeyRangeCacheRegistry.java`	NEW — process-wide registry singleton, keyed by service-endpoint `URI`
`caches/RxPartitionKeyRangeCache.java`	3-arg ctor `(client, collectionCache, URI)`; registry-backed storage; idempotent `close()`; PKR_LOOK_UP diagnostic emitted at outer lookups (so piggybacking clients still record it)
`Configs.java`	New system property `COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED` (default: enabled)
`RxDocumentClientImpl.java`	Pass `this.serviceEndpoint` to the cache ctor; release the cache in `close()`
`caches/SharedPartitionKeyRangeCacheRegistryTest.java`	NEW — 13 unit tests inc. 32-thread concurrency stress, GC-driven leak cleanup, URI host case-insensitivity, and a negative-case pin that regional vs global endpoints don't share
`caches/RxPartitionKeyRangeCacheTest.java`	+5 unit tests: cross-client sharing, cross-endpoint isolation, close idempotency, ctor-lifecycle, cross-client force-refresh visibility
`SharedPartitionKeyRangeCacheE2ETest.java`	NEW — e2e test against a real Cosmos endpoint: positive sharing on the same endpoint (cross-endpoint isolation is covered by unit tests, since `CosmosClientBuilder` normalises the endpoint URI so two distinct connectable endpoints can't be built in a single-endpoint test environment)
`CHANGELOG.md`	Entry under 4.82.0-beta.1

Test plan

✅ mvn install (azure-cosmos)
✅ mvn checkstyle:check spotbugs:check (azure-cosmos + azure-cosmos-tests)
✅ Unit tests pass: 24 tests, 0 failures (RxPartitionKeyRangeCacheTest + SharedPartitionKeyRangeCacheRegistryTest)
⏳ E2e tests (SharedPartitionKeyRangeCacheE2ETest) registered under the emulator and fast Maven profiles — executed in CI against the configured Cosmos endpoint.

Key behavioural tests (unit)

twoCachesForSameEndpointShareRoutingMapStorage — client A populates the routing map, client B serves the same lookup with clientB.readPartitionKeyRanges invoked zero times.
cachesForDifferentEndpointsDoNotShareStorage — clients with different endpoint URIs each invoke their own readPartitionKeyRanges exactly once.
forceRefreshOnSharedCacheIsVisibleToSiblingClient — client A's force-refresh propagates to client B without B issuing its own fetch.
closeIsIdempotent — repeated close() calls do not drive refcount negative.
clientWithServiceEndpointAcquiresAndReleasesRegistryRefcount — regression guard for the RxDocumentClientImpl.close() → partitionKeyRangeCache.close() wiring.
concurrentAcquireAndReleaseProducesConsistentRefcount — 32 threads × 200 ops, refcount ends at 0.
referenceManagerReleasesSharedCacheWhenOwnerIsGarbageCollected — leak-safety net: an unclosed client is reclaimed by ReferenceManager once GC'd.
acquireTreatsHostCaseInsensitivelyMatchingUriEquals — RFC 3986 host case-insensitivity flows through to the registry key.
regionalAndGlobalEndpointsDoNotShareStorage — pins the explicit scope: distinct endpoint URIs use distinct registry entries.
disabledFlagReturnsIsolatedCachesAndPreservesRegistryEmpty — opt-out preserves pre-sharing behaviour.

Key behavioural tests (e2e, real Cosmos endpoint)

twoClientsOnSameEndpointShareRoutingMapStorage — spins up two real CosmosAsyncClients configured with the same endpoint, performs PK-routed reads on both, and asserts they share the same AsyncCacheNonBlocking instance, the registry refcount accounts for both holders, and closing each client decrements the refcount by exactly one.
Cross-endpoint isolation (clients with distinct endpoint URIs use distinct registry entries) is pinned by unit tests — SharedPartitionKeyRangeCacheRegistryTest.acquireReturnsDifferentInstanceForDifferentEndpoints / regionalAndGlobalEndpointsDoNotShareStorage and RxPartitionKeyRangeCacheTest.cachesForDifferentEndpointsDoNotShareStorage — rather than e2e, because CosmosClientBuilder.validateConfig() strips path/query so two distinct connectable endpoint URIs can't be constructed against a single test endpoint.

Breaking changes

None. RxPartitionKeyRangeCache is in the implementation package; its ctor signature and its new Closeable supertype are not part of the public API surface. No customer-visible APIs change.

…ing the same account Move the partition-key-range routing-map cache from per-CosmosClient to a process-wide, refcounted registry keyed by service endpoint. Multiple CosmosClient / CosmosAsyncClient instances in the same JVM targeting the same Cosmos account now share a single AsyncCacheNonBlocking instance for collection -> CollectionRoutingMap, eliminating duplicate routing-map memory and redundant /pkranges fetches. Design - New SharedRoutingMapCacheRegistry (process-wide singleton) holds an AsyncCacheNonBlocking per endpoint URL plus an AtomicInteger refcount. All state transitions go through ConcurrentHashMap.compute, giving atomic per-key check-and-update without a global lock. - RxPartitionKeyRangeCache: new ctor accepts the service endpoint; underlying routingMapCache is obtained from the registry. Implements Closeable; close() releases this client's reference and is idempotent. - RxDocumentClientImpl: passes serviceEndpoint to the cache ctor and releases the cache reference in its close() path. - Opt-out: COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED=false restores the pre-sharing behaviour (each client owns a private cache). Why this is safe - PK-range data is account-level metadata, not credential-bound. - AsyncCacheNonBlocking already enforces single-flight per key; sharing the instance strengthens that to "single in-flight /pkranges per (account, container) across all clients". - The two-arg back-compat ctor resolves the endpoint from the client, so existing mocked tests continue to work (mock returns null endpoint -> isolated cache, matching today's behaviour). Tests - New SharedRoutingMapCacheRegistryTest: acquire/release sharing, refcount eviction, idempotent release, null-endpoint isolation, opt-out flag, 32-thread concurrent acquire/release stress. - New RxPartitionKeyRangeCacheTest cases: two caches at same endpoint share storage (verified by mock /pkranges call count = 1, not 2), caches at different endpoints stay independent, close() is idempotent. - Existing 7 RxPartitionKeyRangeCacheTest cases unchanged and passing. Reference Pattern matches Python (sdk/cosmos/azure-cosmos/azure/cosmos/_routing/ routing_map_provider.py) which uses module-level endpoint-keyed dicts with refcounted cleanup. Adapted to Java idioms (ConcurrentHashMap.compute instead of explicit RLock, Closeable instead of __del__). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR reduces duplicated routing-map cache memory and redundant /pkranges requests by sharing the storage layer of RxPartitionKeyRangeCache across CosmosClient / CosmosAsyncClient instances that target the same Cosmos account (keyed by service endpoint), while keeping the per-client fetch path unchanged. The shared cache is managed by a process-wide, refcounted registry and can be disabled via a new system property for opt-out.

Changes:

Introduces SharedRoutingMapCacheRegistry (endpoint-keyed, refcounted) to share AsyncCacheNonBlocking<String, CollectionRoutingMap> across clients.
Updates RxPartitionKeyRangeCache to acquire shared storage by endpoint and to implement Closeable for refcount release on client shutdown.
Wires RxDocumentClientImpl.close() to release the cache reference, adds config flag plumbing, and adds targeted unit tests + changelog entry.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java	Passes endpoint into the cache ctor and releases the cache reference during client close.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java	Adds `COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED` flag (default true).
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistry.java	New process-wide singleton registry for shared routing-map cache storage with refcounted eviction.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java	Splits “storage” vs “fetcher” by sourcing storage from the shared registry and adding `close()` ref-release.
sdk/cosmos/azure-cosmos/CHANGELOG.md	Documents the new sharing behavior and opt-out property.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistryTest.java	New unit tests validating sharing, eviction, disabled behavior, and concurrency refcount correctness.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java	Adds tests validating cross-client sharing, cross-endpoint isolation, and idempotent close behavior.

Copilot's findings

Files reviewed: 7/7 changed files
Comments generated: 1

xinlian12 · 2026-06-19T15:55:55Z

@sdkReviewAgent

…e host matching Switch SharedRoutingMapCacheRegistry's key type from String to URI so URI.equals() — which is case-insensitive on the host component per RFC 3986 — is used for sharing identity. Previously, two clients built with 'https://Acct.documents.azure.com/' and 'https://acct.documents.azure.com/' would fragment into two registry entries even though they target the same account. With URI as the key the two collapse into a single shared entry. This matches the spirit of the Rust SDK, which uses Url-based equality on its AccountReference identity. Python uses raw string comparison; Java's URI gives us strictly better behaviour for free. Added a new test (acquireTreatsHostCaseInsensitivelyMatchingUriEquals) that asserts URI.equals() considers the two casings equal AND that the registry produces a single shared entry for them. Ran 34 cache unit tests, 0 failures. No public API change. RxPartitionKeyRangeCache's three-arg ctor still takes URI; only the internal field type changed (String -> URI). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…cross-SDK consistency Confirmed via cross-SDK review that both peer Cosmos SDKs key sharing on the user-supplied account endpoint URL, not on the account _rid: - Python (sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py): _resolve_endpoint() returns client.url_connection (the input endpoint string) with no normalisation and no _rid lookup. - Rust (sdk/cosmos/azure_data_cosmos_driver/src/models/account_reference.rs): AccountReference identity is endpoint-only via AccountEndpoint(Url) which Hash/Eq on the Url; PartialEq deliberately excludes credentials and backup endpoints. No _rid involvement. This SDK should match. The "regional vs global endpoint to the same account" case stays a known fragmentation case across all three SDKs rather than something Java solves alone via _rid. Why _rid keying was rejected after exploration: 1. Diverges from Python and Rust — increases mental-model and maintenance cost for cross-SDK contributors. 2. DatabaseAccount.getResourceId() returns the empty string in emulator and some service paths where the account JSON has no _rid (Resource.java:130 delegates to JsonSerializable.getString(R_ID)). Would silently fall back and fragment differently than peers. 3. Brittle to init reorders: today GlobalEndpointManager.init() runs before cache construction, but any future refactor (lazy account fetch, offline-mode init) would silently break sharing. Endpoint URI is constructor-immutable; _rid depends on a successful prior network call. Final shape: - Registry keyed by URI (case-insensitive host via URI.equals). - RxPartitionKeyRangeCache 3-arg ctor takes (client, collectionCache, serviceEndpoint URI). Two-arg ctor delegates with client.getServiceEndpoint(). - JavaDoc on SharedRoutingMapCacheRegistry now explicitly documents the cross-SDK alignment and the regional-endpoint fragmentation tradeoff. All 34 cache unit tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-19T16:32:03Z

✅ Review complete (35:07)

Posted 7 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

…clients Without this safety net, a customer that forgets to call CosmosClient.close() would pin the shared partition-key-range cache entry for the lifetime of the JVM. The owning RxPartitionKeyRangeCache holds a strong reference to the shared AsyncCacheNonBlocking and the registry's refcount stays > 0 forever. Peer SDKs handle this: - Python: __del__ in PartitionKeyRangeCache calls release() as a GC fallback (sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py L192). - Rust: no Drop impl needed — the cache lives as a field on the driver and Rust ownership guarantees cleanup on driver drop. Java cannot use java.lang.ref.Cleaner because azure-cosmos targets Java 8 (verified: sdk/parents/azure-client-sdk-parent/pom.xml <source>1.8</source>). Solution uses the pre-Cleaner pattern: PhantomReference + ReferenceQueue + daemon reaper thread. All Java 1.2+ APIs. Design - SharedRoutingMapCacheRegistry holds: * ReferenceQueue<Object> reaperQueue * Set<OwnerPhantom> livePhantoms (concurrent) — critical for correctness: the JVM only enqueues phantoms that are themselves still strongly reachable, so the registry must hold them alive until processed. * One daemon thread (cosmos-shared-pkr-cache-reaper) blocking on reaperQueue.remove(). - acquire(URI endpoint, Object owner): registers an OwnerPhantom on the owner, adds it to livePhantoms, returns AcquireResult { cache, phantom }. - release(URI, cache, PhantomReference) — new 3-arg overload — clears the phantom and removes it from livePhantoms in addition to decrementing the refcount. This is the path RxPartitionKeyRangeCache.close() uses. - When the owner becomes phantom-reachable, the reaper drains the queue, logs a WARN ("Leaked (unclosed) RxPartitionKeyRangeCache detected..."), calls release(endpoint, cache) to decrement refcount, then removes the phantom from livePhantoms. - close() is still the right primary path; the reaper is a safety net that prevents permanent JVM-lifetime cache pinning, not a substitute. Tests - reaperReleasesSharedCacheWhenOwnerIsGarbageCollected: acquires in a helper method (so the test frame cannot keep owner alive), polls referenceCount while forcing System.gc() in a 15s window. Reaper warning is observable in test output. - promptCloseClearsPhantomSoReaperDoesNotDoubleRelease: validates the prompt-close path clears the phantom and a subsequent GC produces no extra release. 36 cache unit tests pass (was 34, +2 new leak tests). Key correctness note in code The first attempt at this had a subtle bug: acquire() returned the phantom in AcquireResult but the registry didn't hold it. Once the test discarded the AcquireResult, the phantom became unreachable and the JVM never enqueued it — the reaper sat idle forever. The livePhantoms set fixes this. The fields/JavaDoc explicitly document the why. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… net Replace the bespoke PhantomReference + ReferenceQueue + daemon-thread reaper with com.azure.core.util.ReferenceManager.INSTANCE, the SDK-wide singleton that already encapsulates this pattern. ReferenceManagerImpl: - On Java 9+ delegates reflectively to java.lang.ref.Cleaner. - On Java 8 (our baseline) uses an internal PhantomReference + daemon thread named "azure-sdk-referencemanager" — exactly the same mechanism this PR was reimplementing. Confirmed in test output: the leak WARN is logged on the "azure-sdk-referencemanager" thread, proving the azure-core path is wired. Why this is better: - Reuses supported, well-tested azure-core machinery instead of rolling our own. One thread per JVM regardless of how many SDK components opt into the pattern, instead of cosmos adding its own competing thread. - Java 9+ automatically gets the Cleaner-based implementation (better shutdown semantics, less thread-stack overhead). - Drops ~100 lines of bespoke phantom plumbing from SharedRoutingMapCacheRegistry (OwnerPhantom inner class, livePhantoms set, reaper loop). Net negative on code we maintain. Design notes preserved: - The lambda registered with ReferenceManager.INSTANCE.register MUST NOT capture `owner`, otherwise the owner never becomes phantom-reachable. We capture only the endpoint URI and the cache reference (both independent of the owner) and document this constraint in code. - ReleaseHandle is a one-shot AtomicBoolean fulfilment flag shared between the prompt close() path and the deferred ReferenceManager cleanup, so whichever runs first wins via compareAndSet and the refcount is decremented exactly once. 36 cache unit tests still pass; the leak test was renamed to referenceManagerReleasesSharedCacheWhenOwnerIsGarbageCollected to reflect the new mechanism. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Per PR feedback, comments in the shared-cache implementation were too verbose and contained cross-SDK comparisons that don't add value to maintainers reading the Java code. Trimmed everywhere: - SharedRoutingMapCacheRegistry: removed Python/Rust comparison paragraphs, the "Cross-SDK consistency" and "Leaked-client safety net" walls of text, and condensed JavaDoc on individual methods. Kept only the critical "lambda must not capture owner" comment because it's a correctness invariant that's easy to break in a refactor. - RxPartitionKeyRangeCache: removed the long ownerPhantom-style field comments; consolidated the class JavaDoc into two sentences. - Configs: condensed the system-property comment to two lines. - RxDocumentClientImpl: shortened the close-path log message. - CHANGELOG entry: condensed to a single sentence describing the change and the opt-out flag. - Tests: stripped the "First client / Second client" narration, the "must hit the shared cache" explanations, and the multi-paragraph preambles on the leak tests. Kept enough to explain the GC-related test setup since that's not obvious from the code. Behavior unchanged; 36 cache unit tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Renamed SharedRoutingMapCacheRegistry → SharedPartitionKeyRangeCacheRegistry for consistency with the class it serves (RxPartitionKeyRangeCache). - Removed the test-only acquire(URI) overload that bypassed ReferenceManager registration; tests now use acquire(URI, owner) so the cleanup-action path is exercised end-to-end. - Added clientWithServiceEndpointAcquiresAndReleasesRegistryRefcount: regression test guarding the RxDocumentClientImpl.close() → partitionKeyRangeCache.close() → refcount-- wiring. Constructs the cache via the 2-arg ctor (matching production) and asserts the refcount delta on construct and close. - Added forceRefreshOnSharedCacheIsVisibleToSiblingClient: cross-client invalidation propagation. Client A populates → A force-refreshes after a simulated split → B's lookup sees A's refreshed value (same routing-map instance) without issuing its own /pkranges call. Asserts object identity on the shared CollectionRoutingMap. 38 cache unit tests pass (was 36). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-19T22:32:03Z

@sdkReviewAgent

Previous run failed in azure-cosmos-spark_3-3_2-12 with a scala-maven-plugin classpath flake (xsbt/ZincCompiler$sbtAnalyzer$ ClassNotFoundException) unrelated to this PR's changes (PR touches azure-cosmos core; Spark connector is unaffected). Empty commit to re-run the pipeline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-23T20:43:40Z

Pushed 99dfd300f27 — course-correction after reviewing the latest run (build 6471490).

What the previous commit got wrong. It reverted PARTITION_KEY_RANGE_LOOK_UP recording to the /pkranges fetch path only. That fixed the FaultInjection multi-master assertion, but it re-broke a different invariant: with cross-client cache sharing, a fresh client can satisfy a PK-range lookup from a sibling client's already-warm shared cache without issuing a fetch. CosmosDiagnosticsTest.directDiagnostics (group fast, i.e. the *_Tcp_Fast suites) builds a brand-new client against the shared container and asserts the diagnostics contain PARTITION_KEY_RANGE_LOOK_UP — under fetch-only recording that can fail once the shared cache is warm. That's why Session_MultiMaster_MultiRegion_Tcp_Fast was still red. (The other failures in that run — FlakyMultiMaster, one of two CircuitBreaker variants, FI_CircuitBreaker — are the usual flaky multi-region/FI legs; CircuitBreaker_False_Tcp_MultiMaster now passes, confirming the FaultInjection fix landed.)

Correct fix (this commit):

RxPartitionKeyRangeCache — restore outer-lookup recording (tryLookupAsync / tryGetRangeByPartitionKeyRangeId), so every PK-range lookup emits one PARTITION_KEY_RANGE_LOOK_UP entry regardless of cache hit/miss. This is required by cache sharing to preserve the observability contract for piggybacking clients (and is exactly the design the PR description already described).
FaultInjectionMetadataRequestRuleTests — because a single operation can now emit multiple PARTITION_KEY_RANGE_LOOK_UP entries (a cache hit, then the forced-refresh fetch), the injected delay is no longer guaranteed to be on the first entry. Updated the assertion to select the maximum-duration entry (the real delayed fetch). This faithfully preserves the test's intent — the injected /pkranges delay must be observable in diagnostics — and is robust to the new per-lookup recording.

Net: keeps the e2e negative-test removal, and resolves both the FaultInjection multi-master regression and the latent directDiagnostics break with one consistent diagnostics model.

Local validation: azure-cosmos compile + checkstyle:check spotbugs:check green; azure-cosmos-tests checkstyle green + test-compile green; unit suites RxPartitionKeyRangeCacheTest + SharedPartitionKeyRangeCacheRegistryTest = 24 tests, 0 failures. Re-running CI below.

xinlian12 · 2026-06-23T20:43:41Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-23T20:44:10Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-06-23T21:58:48Z

Correction + honest status after analysing the full (paginated) CI results.

My previous comment attributed the *_Tcp_Fast failure to the diagnostics-recording choice. That was based on a truncated view of the checks API (only 30 of 115 legs). With the full results paginated, the picture is different and I want to be accurate:

The *_Tcp_Fast failures are pre-existing and independent of the diagnostics change. All six *_Tcp_Fast legs fail on:

the current HEAD 99dfd300f27 (outer-site recording),
my earlier 9162bb2315d (fetch-only recording), and
the original PR HEAD 5ffb1c04b30 and even d6494d8b5f8 — i.e. before any of my changes.

So they fail under both diagnostics options and predate my work — this is a deterministic interaction between the cross-client cache-sharing feature and a fast-group test, separate from the e2e negative test I removed (that one fixed the now-green emulator legs).

What my changes do resolve (vs the original PR):

✅ Emulator legs (java - cosmos - ci: Java8/Java17/Encryption integration) — green again after removing the unworkable e2e negative test.
✅ Multi-master *_Tcp_MultiMaster (FaultInjectionMetadataRequestRuleTests) — green, via restoring outer-site PARTITION_KEY_RANGE_LOOK_UP recording (your original design, needed so piggybacking clients still record the diagnostic) plus updating that test to assert on the max-duration PKR entry.

What I could not resolve — and need your help on: the pre-existing fast-group failure. Each *_Tcp_Fast leg reports a single test error, but I can't see which test: the java - cosmos - tests build lives in the auth-gated internal ADO project (the public checks API only exposes a generic "PowerShell exited with code 1" annotation), and I couldn't reproduce locally (Docker daemon isn't running here, and several of the affected legs need multi-region/strong-consistency accounts the emulator can't provide).

Could you paste the failing test name + stack from one *_Tcp_Fast leg (e.g. StrongSession_Strong_Tcp_Fast, build 6472263)? My working hypothesis is a fast-group diagnostics/cache test that assumes a freshly-built client always issues its own /pkranges fetch — an assumption that cross-client cache sharing breaks. With the test name I can fix it directly (make it robust to sharing, or scope it with the COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED opt-out).

The code is currently in the minimal state: your original Option-A diagnostics, plus the e2e-negative-test removal and the FaultInjectionMetadataRequestRuleTests max-duration fix.

…o shared PKR cache Root-cause fix for the deterministic `*_Tcp_Fast` live-test failures (present on every run of this PR, independent of the diagnostics-recording choice). `CosmosContainerOpenConnectionsAndInitCachesTest` (group `fast`) asserted the partition-key-range routing-map cache size transitions 0 -> 1 (direct mode) and stays 0 (gateway mode). Those assertions assume a *per-client* routing-map cache. This PR makes the routing-map storage shared per service endpoint, so in the `fast` suite — where many clients (including the suite's long-lived shared client) target the same endpoint — the shared routing map is already warm and accumulates entries for other containers. The absolute-size assertions therefore fail deterministically. Fix (adapt the test to the intended shared-cache semantics, without weakening its purpose of verifying that openConnectionsAndInitCaches warms this container's routing map in direct mode): - Direct mode: assert the shared routing map contains *this* container's entry (keyed by its resourceId, obtained via a container metadata read that does not itself fetch /pkranges) after init, instead of asserting the total size is 1. - Gateway mode: drop the routing-map size assertions (gateway-mode init does not populate the routing map, but a sibling direct-mode test may have, so the shared size is not meaningful here). The per-client collection-cache and isInitialized assertions are retained. The collection cache (collectionInfoByNameMap) is per-client and unaffected, so those assertions are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-23T22:23:17Z

Found and fixed the pre-existing *_Tcp_Fast failure in 3559d7f2d40 (no need to wait on the internal logs — located it by source inspection).

Root cause: CosmosContainerOpenConnectionsAndInitCachesTest (group fast) asserts the partition-key-range routing-map cache size goes 0 -> 1 in direct mode and stays 0 in gateway mode:

assertThat(routingMap.size()).isEqualTo(0);   // before init
...
assertThat(routingMap.size()).isEqualTo(1);   // after init (direct)

These assume a per-client routing-map cache. This PR makes the routing-map storage shared per service endpoint, and in the fast suite many clients (including the suite's long-lived shared client) target the same endpoint — so the shared routing map is already warm and holds entries for other containers. The absolute-size assertions then fail deterministically, which is why this leg has been red on every run of the PR (since d6494d8b5f8), independent of the diagnostics-recording choice, and separate from the e2e test I removed.

Fix (adapts the test to the intended shared-cache semantics without weakening it):

Direct mode: assert the shared routing map contains this container's own entry (keyed by its resourceId, read via a container-metadata read that doesn't itself fetch /pkranges) after init — instead of asserting total size == 1.
Gateway mode: drop the routing-map size assertions (gateway init doesn't populate the routing map, but a sibling direct-mode test may have, so the shared size isn't meaningful here); the per-client collection-cache + isInitialized assertions are kept.

The collection cache (collectionInfoByNameMap) is per-client and untouched by this PR, so those assertions are unchanged.

Local: azure-cosmos-tests test-compile + checkstyle green. Re-running the live pipeline to confirm. This, together with the e2e-negative-test removal (fixed the emulator legs) and the FaultInjectionMetadataRequestRuleTests max-duration fix (fixed *_Tcp_MultiMaster), should clear the deterministic failures; remaining reds (FlakyMultiMaster, FI_CircuitBreaker, CustomerWorkflows) are the usual multi-region/FI flakes.

xinlian12 · 2026-06-23T22:23:18Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-23T22:23:50Z

Azure Pipelines successfully started running 1 pipeline(s).

…eyRangeCacheE2ETest Second root-cause fix for the `*_Tcp_Fast` live-test failures. This e2e test runs in the `fast` group, and its finally block asserted that closing each client drops the registry refcount by *exactly* one / two: assertThat(refCountAfterFirstClose).isEqualTo(refCountBeforeClose - 1); assertThat(refCountAfterSecondClose).isEqualTo(refCountBeforeClose - 2); The registry refcount is keyed by service endpoint and is shared with every other client/test targeting that endpoint. In the highly parallel `fast`/`simple` suite (where essentially all tests use the same endpoint) other tests acquire/release the same registry entry concurrently, so the absolute count moves between measurements and the exact-delta assertions fail. This passed in the lower-parallelism emulator suite but failed deterministically under the live `fast` suite's load. Remove the exact close-delta assertions. The close -> refcount-decrement wiring is already covered deterministically and in isolation by SharedPartitionKeyRangeCacheRegistryTest (including a 32-thread concurrency stress test). This e2e test retains its unique value: proving two real CosmosAsyncClients on the same endpoint share the same AsyncCacheNonBlocking storage instance (isSameAs) and that the shared refcount includes both holders (>= 2) — both robust under parallel execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-23T23:30:12Z

The *_Tcp_Fast legs were still red after the previous commit, so I dug into the second fast-group test that touches the shared routing-map cache. Pushed 1db132122c8.

SharedPartitionKeyRangeCacheE2ETest (group fast) had racy exact refcount-delta assertions in its finally:

assertThat(refCountAfterFirstClose).isEqualTo(refCountBeforeClose - 1);
assertThat(refCountAfterSecondClose).isEqualTo(refCountBeforeClose - 2);

The registry refcount is keyed by service endpoint and shared with every other client/test on that endpoint. In the highly parallel fast/simple suite (essentially all tests use the same endpoint), other tests acquire/release the same registry entry between these measurements, so the absolute count moves and the exact-delta checks fail. This passed in the lower-parallelism emulator suite but failed deterministically under the live fast suite's load — matching the observed emulator-green / _Tcp_Fast-red split.

Fix: removed the exact close-delta assertions. The close → refcount-decrement wiring is already covered deterministically (and under a 32-thread stress test) by SharedPartitionKeyRangeCacheRegistryTest. The e2e test keeps its robust, parallel-safe assertions: two real clients share the same AsyncCacheNonBlocking instance (isSameAs) and the shared refcount includes both holders (>= 2).

So there were two distinct fast-group tests that assumed a per-client / non-shared cache: CosmosContainerOpenConnectionsAndInitCachesTest (size assertions, previous commit) and this one (refcount deltas). Local test-compile + checkstyle green; re-running the live pipeline to confirm.

xinlian12 · 2026-06-23T23:30:14Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-23T23:30:50Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-06-24T00:37:19Z

Status + a request for the exact failing test name. After two fast-group fixes, the *_Tcp_Fast legs are still red (BoundedStaleness / ConsistentPrefix / StrongSession / Session_Tcp_Fast all failed on 1db132122c8). The failures are deterministic, so there is at least one more fast-group test that assumes a per-client (non-shared) routing-map cache, which the registry sharing breaks.

What I've fixed so far (both genuine sharing-incompatibilities, worth keeping):

CosmosContainerOpenConnectionsAndInitCachesTest — asserted routing-map size 0 -> 1 (direct) / 0 (gateway); now asserts this container's own entry via containsKey(rid) and drops the gateway shared-size checks.
SharedPartitionKeyRangeCacheE2ETest — asserted exact registry refcount deltas (-1, -2) on close, which race against other tests sharing the same endpoint in the parallel suite; removed those (close→decrement is covered by SharedPartitionKeyRangeCacheRegistryTest).

What I've ruled out as the remaining culprit: CosmosDiagnosticsTest.directDiagnostics (passes under outer-site recording), the spy-wire tests (DocumentQuerySpyWireContentTest only checks query-header content, not /pkranges counts).

Why I can't pinpoint the rest: the java - cosmos - tests build is in the auth-gated internal ADO project (the public checks API only exposes a generic "PowerShell exited with code 1"); the failure is a suite-level interference (the shared cache must be warmed by sibling test classes), so it doesn't reproduce by running one class in isolation; and several failing legs (BoundedStaleness_Strong, StrongSession_Strong) need multi-region/strong accounts the local emulator can't provide.

Ask: could you paste the failing test class+method from one *_Tcp_Fast failsafe report (e.g. StrongSession_Strong_Tcp_Fast, build 6473xxx)? Any fast-group test that (a) inspects the routing-map cache size/contents, or (b) asserts a fresh client issues its own /pkranges read, would be the kind of thing. With the name I can land the fix immediately (assert on this-container's entry, or scope it with the COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED opt-out).

xinlian12 · 2026-06-24T02:39:39Z

Local validation update on the *_Tcp_Fast failures. I reproduced the fast suite locally against a real multi-region account in direct mode (the same shape as the *_Tcp_Fast legs) and learned the following:

✅ Both fast-group fixes pass there:
- CosmosContainerOpenConnectionsAndInitCachesTest — all variants result success (direct + gateway).
- SharedPartitionKeyRangeCacheE2ETest — result success.
✅ No other cache/routing/refcount-assertion test failed — i.e. the cross-client cache sharing does not break any additional fast cache-isolation test beyond the two already fixed.
The only local failures were load artifacts from running the whole fast suite in parallel against one shared account: ServiceUnavailable (503) and NotFound (404) on CosmosItemTest / CosmosLatestCommittedItemTests reads (throttling → creates fail → reads miss), and a timestamp-ordering flake in CosmosDiagnosticsTest#rntbdStatistics. None of these assert on the shared cache, and they don't reproduce CI's single deterministic error — they're my-machine overload, not the PR.

Takeaway: the two committed fixes cover the genuine sharing-incompatible fast tests, and the PR's sharing doesn't appear to break additional fast cache tests. That suggests the remaining CI *_Tcp_Fast "1 error" is a non-cache test — quite possibly pre-existing/environmental rather than caused by this PR.

Could you confirm two things from the internal *_Tcp_Fast failsafe report (e.g. StrongSession_Strong_Tcp_Fast)?

The exact failing test class + method (the public checks API only shows a generic "PowerShell exited with code 1").
Whether that same test is currently red on main / unrelated PRs (to tell pre-existing vs PR-caused).

With the name I can finish it immediately — make it sharing-aware if it's genuinely a sharing issue, or flag it as pre-existing if it fails independent of this PR.

xinlian12 · 2026-06-24T02:47:17Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-24T02:47:46Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-06-24T03:47:37Z

Fresh re-run confirms *_Tcp_Fast is deterministically red (ConsistentPrefix / StrongSession / Session_Tcp_Fast all failed again on 1db132122c8, with both fixes in and after the account recovered) — so it's a real deterministic failure, not a transient flake.

Combining that with the local reproduction, here's where I've landed:

The two committed fast-group fixes are correct (both pass against the real account), and no cache/routing/refcount-assertion test fails — so the PR's cache sharing does not break any additional fast cache test.
In the local full-suite run the only non-load failures were in non-cache tests that don't depend on the partition-key-range cache at all. The strongest suspect for CI's single deterministic *_Tcp_Fast error is CosmosDiagnosticsTest#rntbdStatistics — its assertion serviceEndpointStatistics.createdTime >= beforeInitializingRntbdServiceEndpoint - 5ms (line ~1464) is about RNTBD service-endpoint timing, dates back to 2020, and is completely independent of the PKR cache change. (The other local failures — CosmosItemTest/CosmosLatestCommittedItemTests reads with 503/404 — were throttling artifacts of my single-account overload and would pass on a provisioned CI account.)

Conclusion: the remaining *_Tcp_Fast failure looks pre-existing / environmental, not caused by this PR — the PR only changes PKR-cache storage and can't affect RNTBD endpoint timing.

Could you confirm from the internal failsafe report whether the failing *_Tcp_Fast test is rntbdStatistics (or name it if different), and whether it's also red on main? If it's rntbdStatistics/pre-existing, this PR is effectively green (open comments resolved, emulator integration green, *_Tcp_MultiMaster fixed, the two real fast cache tests fixed). If it's something cache-related I've missed, the name will let me fix it immediately.

xinlian12 · 2026-06-24T05:42:32Z

Definitive CI root-cause: every remaining *_Tcp red is non-PR-caused. New evidence this round closes the loop without needing the internal failsafe report.

Key reframing: the live legs run the PR merged with upstream main, and this branch is 29 commits behind upstream. So CI executes newer upstream tests/legs that the branch itself doesn't contain. (A trial merge of upstream/main is clean, but redundant — CI already merges — and the 29 commits touch none of this PR's production files and don't modify CosmosDiagnosticsTest, so merging changes nothing.)

*_Tcp_Fast (BoundedStaleness / ConsistentPrefix / StrongSession / Session / MultiMaster) — airtight exoneration: the identical fast group passes over HTTP/2 (Cosmos_Live_Test_Http2 … _Fast = ✅) and fails only over direct/TCP. The shared-PKR-cache change is transport-agnostic (it caches partition-key-ranges identically regardless of TCP/HTTP2/gateway), so it cannot produce a failure that appears only in direct mode. This holds regardless of the exact test name — the prime suspect remains CosmosDiagnosticsTest#rntbdStatistics, an RNTBD-only timing assertion whose source is unchanged on upstream.

*_Tcp_FaultInjectionMultiMaster (group fi-multi-master): this group is the FITests_* availability-strategy suite, every method of which carries an explicit SuperFlakyTestRetryAnalyzer/FlakyTestRetryAnalyzer — i.e. codebase-acknowledged flaky. The FaultInjectionMetadataRequestRuleTests fix from this PR lives in the multi-master group, not here, and is validated by CircuitBreaker_False_Tcp_MultiMaster = ✅.

CustomerWorkflowSingleMasterAvailabilityTest::readFaultInPreferredReadableRegionCanUseRemoteReadableRegion (surfaced in the aggregate check annotations): a brand-new upstream test added 2026-06-22 by #49568 (com.azure.cosmos.workflows.customer, dedicated fi-sm-customer-workflows leg). It's a multi-region fault-injection availability test independent of the PKR cache; the stale branch never had it.

Remaining: Tcp_FlakyMultiMaster / Tcp_Query (known flaky), Kafka CosmosSourceTaskTest#poll (ServiceUnavailable / partition-split throttling), Spark33 emulator (flaky integration).

Positive validation that cache-sharing is correct: all the multi-master/multi-region direct legs that most stress the shared cache are green — Tcp_Direct, Tcp_Split, Tcp_ChangeFeedSplit, Tcp_CircuitBreakerMiscDirect/Gateway, Tcp_CircuitBreakerReadAllAndReadMany, and CircuitBreaker_False_Tcp_MultiMaster.

Net: all PR-caused failures are fixed and validated; every remaining red is flaky, pre-existing direct-mode, or a new upstream test. No further code change is warranted. If the internal report shows the *_Tcp_Fast failure is anything other than rntbdStatistics/an RNTBD-timing test, please name it and I'll address it immediately — but the HTTP/2-passes/TCP-fails split already rules out the PKR-cache change as the cause.

…e responses without an activity id The _Tcp_Fast (direct-mode) live legs deterministically failed in CosmosTracerTest.cosmosAsyncContainer with 'Argument activityId must not be null'. Root cause is a pre-existing bug in addRequestInfoForStoreResponses: it seeds activityId from the (always non-null) request-level ClientSideRequestStatistics activity id, then unconditionally overwrites it with the store response's server-side activity id, which is null for certain responses (e.g. transient transport errors in Direct mode). The null then trips the non-null contract of CosmosDiagnosticsRequestInfo in getRequestInfo(). Fall back to the request-level activity id when the store response does not carry one, mirroring the existing null-fallback used for gateway statistics in ClientSideRequestStatistics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-24T22:13:40Z

Root-caused and fixed the deterministic *_Tcp_Fast failure.

Pulling the real failing test from the aggregate java - cosmos - tests check-run annotations (the per-leg logs are auth-gated), the *_Tcp_Fast legs (BoundedStaleness / ConsistentPrefix / StrongSession / Session / MultiMaster) all failed the same test:

CosmosTracerTest.cosmosAsyncContainer[Direct Tcp ...] — NullPointerException: Argument 'activityId' must not be null.
  at CosmosDiagnosticsRequestInfo.<init>(CosmosDiagnosticsRequestInfo.java:42)
  at CosmosDiagnosticsContext.addRequestInfoForStoreResponses(CosmosDiagnosticsContext.java:855)
  at CosmosDiagnosticsContext.getRequestInfo(CosmosDiagnosticsContext.java:945)

Root cause (pre-existing, surfaced by the Direct-mode legs): addRequestInfoForStoreResponses seeds activityId from the request-level ClientSideRequestStatistics activity id (always non-null), then unconditionally overwrites it with the store response's server-side activity id — which is null for certain responses (e.g. transient transport errors in Direct mode, where StoreResponseDiagnostics.getActivityId() comes from the exception path). The null then trips the non-null contract of CosmosDiagnosticsRequestInfo. This is a latent bug dating to 2023 (#35254) and is independent of the partition-key-range cache change — but the Direct-mode (*_Tcp) legs hit it deterministically, which is why HTTP/2 (gateway) *_Fast stays green (no Direct-mode store responses → no address/store path with a null activity id).

Fix (abb16811c42): fall back to the request-level activity id when the store response doesn't carry one, mirroring the null-fallback already used for gateway statistics in ClientSideRequestStatistics (line 320). Minimal, 9-line change in CosmosDiagnosticsContext; CHANGELOG updated under Bugs Fixed.

Validation: azure-cosmos builds; Checkstyle 0 violations + SpotBugs clean; CosmosTracerTest#cosmosAsyncContainer runs green locally (42 run / 0 failures) against a real Direct-mode account; the fallback is provably non-null because ClientSideRequestStatistics.recordResponse always sets the request-level activity id from the per-request UUID.

Re-running live tests to confirm *_Tcp_Fast goes green. (The remaining *_FaultInjectionMultiMaster reds are the fi-multi-master FITests_* suite, which carry explicit FlakyTestRetryAnalyzer/SuperFlakyTestRetryAnalyzer — flaky, not a named assertion failure.)

/azp run java - cosmos - tests

xinlian12 · 2026-06-24T22:16:49Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-24T22:16:56Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

xinlian12 · 2026-06-24T22:18:04Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-24T22:18:11Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

xinlian12 · 2026-06-24T22:22:06Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-24T22:22:13Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

…ition-key-range-cache # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md

xinlian12 · 2026-06-24T22:26:02Z

/azp run java - cosmos - tests

azure-pipelines · 2026-06-24T22:26:39Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel

LGTM

Copilot AI review requested due to automatic review settings June 18, 2026 18:42

xinlian12 requested review from a team and kirankumarkolli as code owners June 18, 2026 18:42

github-actions Bot added the Cosmos label Jun 18, 2026

Copilot started reviewing on behalf of xinlian12 June 18, 2026 18:42 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java Outdated

xinlian12 changed the title ~~[Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account~~ [NO REVIEW][Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account Jun 18, 2026

xinlian12 and others added 2 commits June 19, 2026 09:22

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java Outdated

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...smos/src/main/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistry.java Outdated

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...re-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java Outdated

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...re-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java Outdated

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...c/main/java/com/azure/cosmos/implementation/caches/SharedPartitionKeyRangeCacheRegistry.java Outdated

xinlian12 and others added 3 commits June 19, 2026 09:54

remove kafka test output

9b43616

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...c/main/java/com/azure/cosmos/implementation/caches/SharedPartitionKeyRangeCacheRegistry.java

xinlian12 commented Jun 19, 2026

View reviewed changes

Comment thread ...smos/src/main/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistry.java Outdated

xinlian12 force-pushed the feature/shared-partition-key-range-cache branch from 1a27dc2 to 9b43616 Compare June 19, 2026 20:39

xinlian12 and others added 2 commits June 19, 2026 14:56

Merge remote-tracking branch 'upstream/main' into feature/shared-part…

0f35e7b

…ition-key-range-cache # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md

FabianMeiswinkel approved these changes Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

xinlian12 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Design

Scope of sharing

Concurrency model

Lifecycle

Diagnostics

Opt-out

Files

Test plan

Key behavioural tests (unit)

Key behavioural tests (e2e, real Cosmos endpoint)

Breaking changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

xinlian12 commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented Jun 19, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

azure-pipelines Bot commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

azure-pipelines Bot commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 23, 2026

Uh oh!

azure-pipelines Bot commented Jun 23, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

azure-pipelines Bot commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

azure-pipelines Bot commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 24, 2026

Uh oh!

azure-pipelines Bot commented Jun 24, 2026

Uh oh!

xinlian12 commented Jun 18, 2026 •

edited

Loading