Skip to content

FailoverClusterClient does not retry during failover related NOREPLICAS errors #3636

@Josh-McWilliam

Description

@Josh-McWilliam

3-redis 3-sentinel cluster with min-replicas-to-write of 1
When continuously writing data with a FailoverClusterClient, executing a failover via the sentinels will cause the client error out with a NOREPLICAS error regardless of the retry settings. This typically resolves itself within ~50ms as the replicas are reassigned to the new master.

Expected Behavior

During a failover the clusterclient just retry with backoff on NOREPLICAS errors (within the bounds or the maxretries/maxredirect setting), this will give it time to reassign remaining nodes as replicas and resolve the issue.

Current Behavior

NOREPLICAS errors cause .Set to return an err and does not attempt to retry regardless of settings. A manual retry loop solves this issue though it shouldn't be needed. As far as I can tell this issue also occurs with the regular FailoverClient, not just the FailoverClusterClient

Possible Solution

There exists has a preset list of errors in shouldRetry (error.go#79), modify this function (or the calling function if you want to differentiate the write/read behavior) to allow retrying on NOREPLICAS

Steps to Reproduce

(This was originally encountered on a k8s cluster, though I can try create a demo if the provided steps are not sufficient)

  1. Redis cluster with 3 nodes, 3 sentinels, 1 primary and a min-replicas-to-write of 1
  2. Simple for loop with a FailoverClusterClient pointed at one of the sentinels constantly sending set requests
  3. have the sentinel failover redis-cli -p 26379 sentinel failover redismaster
  4. Client will eventually fail with a NOREPLICAS error.

Context (Environment)

This was originally encountered on a local kind cluster, it is possible the issue may be less common in the case of faster hardware/networking that allows sentinels/replicas to reassign slaves faster
While this does seem to happen with a standard FailoverClient, I have not done much testing with different config options for the FailoverClient

Detailed Description

The Failover/FailoverCluster client should not fail during a failover, a specific catch for this error or a global/fallback retry setting would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions