Skip to content

fix(spanner): enforce READY-only location aware routing and add endpoint lifecycle management#12678

Open
rahul2393 wants to merge 1 commit intomainfrom
spanner-location-aware-channel-probing
Open

fix(spanner): enforce READY-only location aware routing and add endpoint lifecycle management#12678
rahul2393 wants to merge 1 commit intomainfrom
spanner-location-aware-channel-probing

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

Summary

  • READY-only bypass routing: isHealthy() now requires READY state instead of
    "not SHUTDOWN and not TRANSIENT_FAILURE". IDLE/CONNECTING endpoints silently fall
    back to the default host without emitting skipped_tablets.
  • State-aware skipped_tablets: only TRANSIENT_FAILURE endpoints are reported
    in skipped_tablets so the server can refresh the client cache. Other non-ready
    states (IDLE, CONNECTING, absent) are skipped silently.
  • Non-creating lookup path: added getIfPresent() to ChannelEndpointCache so
    the foreground request path never triggers blocking endpoint creation.
  • Affinity fix: affinityEndpoint() uses getIfPresent() + READY check, falling
    back to default instead of forcing traffic to a stale replica.
  • EndpointLifecycleManager (new): background GetSession probes keep replica
    channels warm, per-endpoint lastRealTrafficAt tracking enables idle eviction
    after 30 minutes, and requestEndpointRecreation() is called from the routing
    path when an evicted endpoint is needed again.
  • Stale endpoint detection: shouldSkip() detects shutdown channels from evicted
    endpoints, clears the cached reference, and re-lookups from the cache so routing
    picks up recreated endpoints.

…int lifecycle management

Location aware routing previously treated IDLE and CONNECTING channels as healthy,
which could send traffic to stale replicas after cache updates. This change
tightens endpoint readiness to READY-only, adds state-aware skipped_tablets
reporting (TRANSIENT_FAILURE only), and introduces a background lifecycle
manager that probes endpoints with GetSession to keep channels warm and
evicts idle endpoints after 30 minutes of no real traffic.
@rahul2393 rahul2393 requested review from a team as code owners April 4, 2026 03:38
@rahul2393 rahul2393 requested a review from sakthivelmanii April 4, 2026 03:38
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the EndpointLifecycleManager to manage the lifecycle of location-aware routing endpoints, including background probing, traffic tracking, and idle eviction. It updates the routing logic in KeyRangeCache and KeyAwareChannel to be state-aware, ensuring only READY endpoints are utilized and TRANSIENT_FAILURE states are reported. Feedback suggests expanding the session extraction logic to include partition requests and addresses a potential concurrency bottleneck in the single-threaded executor used for background endpoint creation.

Comment on lines +122 to +128
Executors.newScheduledThreadPool(
1,
r -> {
Thread t = new Thread(r, "spanner-endpoint-lifecycle");
t.setDaemon(true);
return t;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ScheduledExecutorService is created with a single thread. While most operations are asynchronous gRPC calls, endpointCache.get(address) (called in createAndStartProbing) can be a blocking operation depending on the implementation of the ChannelEndpointCache. If many endpoints are being created simultaneously, this single thread could become a bottleneck. Consider using a small pool of threads or ensuring that endpointCache.get is non-blocking.

Comment on lines +595 to +608
private static String extractSessionFromMessage(Object message) {
if (message instanceof ReadRequest) {
return ((ReadRequest) message).getSession();
} else if (message instanceof ExecuteSqlRequest) {
return ((ExecuteSqlRequest) message).getSession();
} else if (message instanceof BeginTransactionRequest) {
return ((BeginTransactionRequest) message).getSession();
} else if (message instanceof CommitRequest) {
return ((CommitRequest) message).getSession();
} else if (message instanceof RollbackRequest) {
return ((RollbackRequest) message).getSession();
}
return null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The extractSessionFromMessage method currently handles several request types but misses others that might also contain session information, such as PartitionReadRequest or PartitionQueryRequest. While these might be less frequent in the context of location-aware routing, adding them would ensure more consistent session name capture for the lifecycle manager.

Suggested change
private static String extractSessionFromMessage(Object message) {
if (message instanceof ReadRequest) {
return ((ReadRequest) message).getSession();
} else if (message instanceof ExecuteSqlRequest) {
return ((ExecuteSqlRequest) message).getSession();
} else if (message instanceof BeginTransactionRequest) {
return ((BeginTransactionRequest) message).getSession();
} else if (message instanceof CommitRequest) {
return ((CommitRequest) message).getSession();
} else if (message instanceof RollbackRequest) {
return ((RollbackRequest) message).getSession();
}
return null;
}
private static String extractSessionFromMessage(Object message) {
if (message instanceof ReadRequest) {
return ((ReadRequest) message).getSession();
} else if (message instanceof ExecuteSqlRequest) {
return ((ExecuteSqlRequest) message).getSession();
} else if (message instanceof BeginTransactionRequest) {
return ((BeginTransactionRequest) message).getSession();
} else if (message instanceof CommitRequest) {
return ((CommitRequest) message).getSession();
} else if (message instanceof RollbackRequest) {
return ((RollbackRequest) message).getSession();
} else if (message instanceof com.google.spanner.v1.PartitionReadRequest) {
return ((com.google.spanner.v1.PartitionReadRequest) message).getSession();
} else if (message instanceof com.google.spanner.v1.PartitionQueryRequest) {
return ((com.google.spanner.v1.PartitionQueryRequest) message).getSession();
}
return null;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant