Fix direct retry handling for partition key range gone#49613
Fix direct retry handling for partition key range gone#49613arnabnandy7 wants to merge 6 commits into
Conversation
|
Thank you for your contribution @arnabnandy7! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a gap in Cosmos direct-mode retry behavior where PartitionKeyRangeGoneException (410/1002) thrown during address resolution could bypass the “Gone-family” retry path and leak to callers. The retry policy now treats this exception as retriable, clears stale resolved partition state, forces a partition key range (routing-map) refresh, and wraps exhausted retries as 503 with the PARTITION_KEY_RANGE_GONE_EXCEEDED_RETRY_LIMIT substatus.
Changes:
- Added
PartitionKeyRangeGoneExceptionhandling toGoneAndRetryWithRetryPolicyas a retriable Gone-family condition. - On
PartitionKeyRangeGoneException, clears stale per-request resolved partition state and forces partition key range refresh before retrying. - Added unit tests covering retry + exhausted-budget wrapping behavior, and documented the fix in the Cosmos changelog.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/GoneAndRetryWithRetryPolicy.java | Treats PartitionKeyRangeGoneException as retriable; clears stale request routing state and forces PK-range refresh. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Adds a bug-fix entry describing the retry/wrapping behavior change. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/GoneAndRetryWithRetryPolicyTest.java | Adds unit tests validating retry behavior and 503 wrapping when the retry budget is exhausted. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…ew bullet is missing the trailing period.
| return handlePartitionIsMigratingException((PartitionIsMigratingException)exception); | ||
| } else if (exception instanceof PartitionKeyRangeIsSplittingException) { | ||
| return handlePartitionKeyIsSplittingException((PartitionKeyRangeIsSplittingException) exception); | ||
| } else if (exception instanceof PartitionKeyRangeGoneException) { |
There was a problem hiding this comment.
This change would need adjoining FaultInjection tests as well to simulate e2e behavior - PKRangeGone as far as I recall is only happening for address lookups against Gateway - and is handled there. So, I don't think adding it to GoneAndRetryWithRetryPolicy is needed / helps.
@xinlian12 - can you double-check?
There was a problem hiding this comment.
Thanks @FabianMeiswinkel that makes sense. The intent here was to cover the case where direct-mode address resolution surfaces 410/1002 as PartitionKeyRangeGoneException before the request can be re-routed, so this policy clears the resolved PKRange and forces a PKRange refresh similar to split handling.
I agree the PR should prove that path with FaultInjection instead of only unit tests. I see FaultInjectionServerErrorType.PARTITION_IS_GONE maps to 410/1002, so I can add direct-mode FI coverage around that. If this exception is guaranteed to be handled before reaching GoneAndRetryWithRetryPolicy, then I’m happy to close/rework this PR instead. @xinlian12 could you confirm the expected layer?
Description
Fixes direct connectivity retry handling for
PartitionKeyRangeGoneExceptionthrown during address resolution. The retry policy now treats this exception as part of the Gone-family retry path, clears stale resolved partition range state, forces partition key range refresh, and surfaces exhausted retries as 503 with the partition-key-range-gone retry-limit substatus instead of leaking 410/1002.Fixes #49381.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines