Skip to content

Fix direct retry handling for partition key range gone#49613

Open
arnabnandy7 wants to merge 6 commits into
Azure:mainfrom
arnabnandy7:fix/cosmos-410-1002-pkrg
Open

Fix direct retry handling for partition key range gone#49613
arnabnandy7 wants to merge 6 commits into
Azure:mainfrom
arnabnandy7:fix/cosmos-410-1002-pkrg

Conversation

@arnabnandy7

Copy link
Copy Markdown
Contributor

Description

Fixes direct connectivity retry handling for PartitionKeyRangeGoneException thrown during address resolution. The retry policy now treats this exception as part of the Gone-family retry path, clears stale resolved partition range state, forces partition key range refresh, and surfaces exhausted retries as 503 with the partition-key-range-gone retry-limit substatus instead of leaking 410/1002.

Fixes #49381.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings June 23, 2026 21:46
@arnabnandy7 arnabnandy7 requested review from a team and kirankumarkolli as code owners June 23, 2026 21:46
@github-actions github-actions Bot added Community Contribution Community members are working on the issue Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution @arnabnandy7! We will review the pull request and get back to you soon.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a gap in Cosmos direct-mode retry behavior where PartitionKeyRangeGoneException (410/1002) thrown during address resolution could bypass the “Gone-family” retry path and leak to callers. The retry policy now treats this exception as retriable, clears stale resolved partition state, forces a partition key range (routing-map) refresh, and wraps exhausted retries as 503 with the PARTITION_KEY_RANGE_GONE_EXCEEDED_RETRY_LIMIT substatus.

Changes:

  • Added PartitionKeyRangeGoneException handling to GoneAndRetryWithRetryPolicy as a retriable Gone-family condition.
  • On PartitionKeyRangeGoneException, clears stale per-request resolved partition state and forces partition key range refresh before retrying.
  • Added unit tests covering retry + exhausted-budget wrapping behavior, and documented the fix in the Cosmos changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/GoneAndRetryWithRetryPolicy.java Treats PartitionKeyRangeGoneException as retriable; clears stale request routing state and forces PK-range refresh.
sdk/cosmos/azure-cosmos/CHANGELOG.md Adds a bug-fix entry describing the retry/wrapping behavior change.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/GoneAndRetryWithRetryPolicyTest.java Adds unit tests validating retry behavior and 503 wrapping when the retry budget is exhausted.

Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated
arnabnandy7 and others added 2 commits June 24, 2026 03:26
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

return handlePartitionIsMigratingException((PartitionIsMigratingException)exception);
} else if (exception instanceof PartitionKeyRangeIsSplittingException) {
return handlePartitionKeyIsSplittingException((PartitionKeyRangeIsSplittingException) exception);
} else if (exception instanceof PartitionKeyRangeGoneException) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would need adjoining FaultInjection tests as well to simulate e2e behavior - PKRangeGone as far as I recall is only happening for address lookups against Gateway - and is handled there. So, I don't think adding it to GoneAndRetryWithRetryPolicy is needed / helps.

@xinlian12 - can you double-check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FabianMeiswinkel that makes sense. The intent here was to cover the case where direct-mode address resolution surfaces 410/1002 as PartitionKeyRangeGoneException before the request can be re-routed, so this policy clears the resolved PKRange and forces a PKRange refresh similar to split handling.

I agree the PR should prove that path with FaultInjection instead of only unit tests. I see FaultInjectionServerErrorType.PARTITION_IS_GONE maps to 410/1002, so I can add direct-mode FI coverage around that. If this exception is guaranteed to be handled before reaching GoneAndRetryWithRetryPolicy, then I’m happy to close/rework this PR instead. @xinlian12 could you confirm the expected layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[COSMOS] Direct: Fixes 410/1002 PartitionKeyRangeGone bubbling up to point-read callers without routing-map refresh

3 participants