-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
3-redis 3-sentinel cluster with min-replicas-to-write of 1
When continuously writing data with a FailoverClusterClient, executing a failover via the sentinels will cause the client error out with a NOREPLICAS error regardless of the retry settings. This typically resolves itself within ~50ms as the replicas are reassigned to the new master.
Expected Behavior
During a failover the clusterclient just retry with backoff on NOREPLICAS errors (within the bounds or the maxretries/maxredirect setting), this will give it time to reassign remaining nodes as replicas and resolve the issue.
Current Behavior
NOREPLICAS errors cause .Set to return an err and does not attempt to retry regardless of settings. A manual retry loop solves this issue though it shouldn't be needed. As far as I can tell this issue also occurs with the regular FailoverClient, not just the FailoverClusterClient
Possible Solution
There exists has a preset list of errors in shouldRetry (error.go#79), modify this function (or the calling function if you want to differentiate the write/read behavior) to allow retrying on NOREPLICAS
Steps to Reproduce
(This was originally encountered on a k8s cluster, though I can try create a demo if the provided steps are not sufficient)
- Redis cluster with 3 nodes, 3 sentinels, 1 primary and a min-replicas-to-write of 1
- Simple for loop with a FailoverClusterClient pointed at one of the sentinels constantly sending set requests
- have the sentinel failover
redis-cli -p 26379 sentinel failover redismaster - Client will eventually fail with a NOREPLICAS error.
Context (Environment)
This was originally encountered on a local kind cluster, it is possible the issue may be less common in the case of faster hardware/networking that allows sentinels/replicas to reassign slaves faster
While this does seem to happen with a standard FailoverClient, I have not done much testing with different config options for the FailoverClient
Detailed Description
The Failover/FailoverCluster client should not fail during a failover, a specific catch for this error or a global/fallback retry setting would be appreciated.