fix(spanner): enforce READY-only location aware routing and add endpoint lifecycle management#12678
fix(spanner): enforce READY-only location aware routing and add endpoint lifecycle management#12678
Conversation
…int lifecycle management Location aware routing previously treated IDLE and CONNECTING channels as healthy, which could send traffic to stale replicas after cache updates. This change tightens endpoint readiness to READY-only, adds state-aware skipped_tablets reporting (TRANSIENT_FAILURE only), and introduces a background lifecycle manager that probes endpoints with GetSession to keep channels warm and evicts idle endpoints after 30 minutes of no real traffic.
There was a problem hiding this comment.
Code Review
This pull request introduces the EndpointLifecycleManager to manage the lifecycle of location-aware routing endpoints, including background probing, traffic tracking, and idle eviction. It updates the routing logic in KeyRangeCache and KeyAwareChannel to be state-aware, ensuring only READY endpoints are utilized and TRANSIENT_FAILURE states are reported. Feedback suggests expanding the session extraction logic to include partition requests and addresses a potential concurrency bottleneck in the single-threaded executor used for background endpoint creation.
| Executors.newScheduledThreadPool( | ||
| 1, | ||
| r -> { | ||
| Thread t = new Thread(r, "spanner-endpoint-lifecycle"); | ||
| t.setDaemon(true); | ||
| return t; | ||
| }); |
There was a problem hiding this comment.
The ScheduledExecutorService is created with a single thread. While most operations are asynchronous gRPC calls, endpointCache.get(address) (called in createAndStartProbing) can be a blocking operation depending on the implementation of the ChannelEndpointCache. If many endpoints are being created simultaneously, this single thread could become a bottleneck. Consider using a small pool of threads or ensuring that endpointCache.get is non-blocking.
| private static String extractSessionFromMessage(Object message) { | ||
| if (message instanceof ReadRequest) { | ||
| return ((ReadRequest) message).getSession(); | ||
| } else if (message instanceof ExecuteSqlRequest) { | ||
| return ((ExecuteSqlRequest) message).getSession(); | ||
| } else if (message instanceof BeginTransactionRequest) { | ||
| return ((BeginTransactionRequest) message).getSession(); | ||
| } else if (message instanceof CommitRequest) { | ||
| return ((CommitRequest) message).getSession(); | ||
| } else if (message instanceof RollbackRequest) { | ||
| return ((RollbackRequest) message).getSession(); | ||
| } | ||
| return null; | ||
| } |
There was a problem hiding this comment.
The extractSessionFromMessage method currently handles several request types but misses others that might also contain session information, such as PartitionReadRequest or PartitionQueryRequest. While these might be less frequent in the context of location-aware routing, adding them would ensure more consistent session name capture for the lifecycle manager.
| private static String extractSessionFromMessage(Object message) { | |
| if (message instanceof ReadRequest) { | |
| return ((ReadRequest) message).getSession(); | |
| } else if (message instanceof ExecuteSqlRequest) { | |
| return ((ExecuteSqlRequest) message).getSession(); | |
| } else if (message instanceof BeginTransactionRequest) { | |
| return ((BeginTransactionRequest) message).getSession(); | |
| } else if (message instanceof CommitRequest) { | |
| return ((CommitRequest) message).getSession(); | |
| } else if (message instanceof RollbackRequest) { | |
| return ((RollbackRequest) message).getSession(); | |
| } | |
| return null; | |
| } | |
| private static String extractSessionFromMessage(Object message) { | |
| if (message instanceof ReadRequest) { | |
| return ((ReadRequest) message).getSession(); | |
| } else if (message instanceof ExecuteSqlRequest) { | |
| return ((ExecuteSqlRequest) message).getSession(); | |
| } else if (message instanceof BeginTransactionRequest) { | |
| return ((BeginTransactionRequest) message).getSession(); | |
| } else if (message instanceof CommitRequest) { | |
| return ((CommitRequest) message).getSession(); | |
| } else if (message instanceof RollbackRequest) { | |
| return ((RollbackRequest) message).getSession(); | |
| } else if (message instanceof com.google.spanner.v1.PartitionReadRequest) { | |
| return ((com.google.spanner.v1.PartitionReadRequest) message).getSession(); | |
| } else if (message instanceof com.google.spanner.v1.PartitionQueryRequest) { | |
| return ((com.google.spanner.v1.PartitionQueryRequest) message).getSession(); | |
| } | |
| return null; | |
| } |
Summary
isHealthy()now requiresREADYstate instead of"not SHUTDOWN and not TRANSIENT_FAILURE". IDLE/CONNECTING endpoints silently fall
back to the default host without emitting
skipped_tablets.skipped_tablets: onlyTRANSIENT_FAILUREendpoints are reportedin
skipped_tabletsso the server can refresh the client cache. Other non-readystates (IDLE, CONNECTING, absent) are skipped silently.
getIfPresent()toChannelEndpointCachesothe foreground request path never triggers blocking endpoint creation.
affinityEndpoint()usesgetIfPresent()+ READY check, fallingback to default instead of forcing traffic to a stale replica.
GetSessionprobes keep replicachannels warm, per-endpoint
lastRealTrafficAttracking enables idle evictionafter 30 minutes, and
requestEndpointRecreation()is called from the routingpath when an evicted endpoint is needed again.
shouldSkip()detects shutdown channels from evictedendpoints, clears the cached reference, and re-lookups from the cache so routing
picks up recreated endpoints.