From 75824b32aa5793c068a5d5d236494687439677c6 Mon Sep 17 00:00:00 2001 From: Aaron Kuehler Date: Tue, 13 Feb 2024 08:45:32 -0500 Subject: [PATCH] Monitor and heal master replica connections (#43) When a RedisFailover's "master" Redis node is being replicated to a "slave" Redis node that is NOT part of the RedisFailover the redis-operator resets the sentinels indefinitely. Consider this scenario, the RedisFailover is being replicated asynchronously to a warm standby Redis cluster in a different data center to handle primary data site outages. Usually we'd configure the secondary site to replicate from the "slave" nodes of the Primary site's RedisFailover. However, if a failover occurs in the primary data site, it's possible that the "slave" to which the secondary site is connected to and replicating from is promoated to the Primary site's new master. When this happens, sentinel picks up the secondary site's replication connections and adds them to the list of replicas to consider for leader election. Thankfully, the operator prevents the sentinels from communicating with any pods that it ought NOT consider for leader election, so failovers still behave as expected. However, this causes the redis-operator to detect that the sentinels are trying to monitor replicas that they shouldn't and calls `SENTINEL RESET` to clear any stale replica entries form the sentinel. The secondary site is still replicating from the newly promoted master so the secondary site's replication connections are added back to the sentinel replicas list when the sentinel calls `INFO` on the primary site's "master"; repeating the reset cycle indefinitely. This change assumes that any replication not immediately meant to be managed by the RedisFailover should connect via the RedisFailover's "slave" nodes; the operator provides services to reach these nodes. When the operator detects that the master node has replication connections that would otherwise confuse the sentinel's leader election, it attempts to clean stale replication connections by resetting them; forcing replication clients to re-establish connections to a "slave" node in the primary site rather than the master. --- CHANGELOG.md | 6 +- metrics/metrics.go | 2 + .../service/RedisFailoverCheck.go | 14 ++++ .../service/RedisFailoverHeal.go | 14 ++++ mocks/service/redis/Client.go | 38 +++++++++ operator/redisfailover/checker.go | 9 +++ operator/redisfailover/checker_test.go | 3 + operator/redisfailover/service/check.go | 13 ++++ operator/redisfailover/service/check_test.go | 44 +++++++++++ operator/redisfailover/service/heal.go | 16 ++++ operator/redisfailover/service/heal_test.go | 28 +++++++ service/redis/client.go | 78 +++++++++++++++---- 12 files changed, 250 insertions(+), 15 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 2eebb6d76..1ca5019ba 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,10 @@ Also check this project's [releases](https://github.com/powerhome/redis-operator ## Unreleased +### Fixed + +- Operator detects and attempts to heal excessive replication connections on the master node. This prevents excessive sentinel resets from the operator when extra-RedisFailvoer replication connnections are present on the "slave" nodes. #43 + ## [v2.0.1] - 2024-02-09 ### Fixed @@ -25,7 +29,7 @@ This update modifies how the operator generates network policies. In version v2. Update notes: -This release will change the labels of the HAProxy deployment resource. +This release will change the labels of the HAProxy deployment resource. It's important to note that in API version apps/v1, a Deployment's label selector [cannot be changed once it's created](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#label-selector-updates). Therefore, any existing HAProxy deployment placed by an