Failed update breaks redis cluster #1097

Leo791 · 2024-10-08T17:44:04Z

What version of redis operator are you using?

redis-operator version:
v0.18.0

Does this issue reproduce with the latest release?
Yes

What operating system and processor architecture are you using (kubectl version)?

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

What did you do?

Created a redis cluster with 3 shards, each node has 1 cpu and 2Gi of memory. Added the annotation redis.opstreelabs.in/recreate-statefulset: "true"to the crd
Updated the cluster to 2 cpu and 4Gi, the rolling update fails halfway around the leaders due to lack of resources.
Tried to rollback the update by updating the crd to 1 cpu and 2Gi of memory again.
Deleted the pending pod

What did you expect to see?

The stateful set updated
The pods that had been updated being updated again to the initial values

What did you see instead?

The stateful set is not updated
The pod restarts again as pending
No pods are restarted with the new values

The operator throws the following errors:

{"level":"error","ts":"2024-10-08T17:33:40Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"redis-instance8","podName":"redis-instance8-leader-3","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.configureRedisClient\n\t/workspace/k8sutils/redis.go:383\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.VerifyLeaderPod\n\t/workspace/k8sutils/cluster-scaling.go:372\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:82\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:40Z","logger":"controllers.RedisCluster","msg":"Failed to Get the role Info of the","redis pod":"redis-instance8-leader-3","error":"dial tcp :6379: connect: connection refused","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.verifyLeaderPodInfo\n\t/workspace/k8sutils/cluster-scaling.go:380\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.VerifyLeaderPod\n\t/workspace/k8sutils/cluster-scaling.go:374\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:82\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:41Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"redis-instance8","podName":"redis-instance8-leader-3","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ClusterFailover\n\t/workspace/k8sutils/cluster-scaling.go:409\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:86\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:41Z","logger":"controllers.RedisCluster","msg":"Could not get pod info","Pod Name":"redis-instance8-leader-3","Namespace":"redis-instance8","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getContainerID\n\t/workspace/k8sutils/redis.go:447\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand1\n\t/workspace/k8sutils/redis.go:413\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand\n\t/workspace/k8sutils/redis.go:395\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ClusterFailover\n\t/workspace/k8sutils/cluster-scaling.go:424\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:86\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:41Z","logger":"controllers.RedisCluster","msg":"Could not find pod to execute","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand1\n\t/workspace/k8sutils/redis.go:415\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand\n\t/workspace/k8sutils/redis.go:395\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ClusterFailover\n\t/workspace/k8sutils/cluster-scaling.go:424\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:86\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:43Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"redis-instance8","podName":"redis-instance8-leader-3","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.configureRedisClient\n\t/workspace/k8sutils/redis.go:383\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisNodeID\n\t/workspace/k8sutils/cluster-scaling.go:111\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RemoveRedisFollowerNodesFromCluster\n\t/workspace/k8sutils/cluster-scaling.go:305\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:89\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:43Z","logger":"controllers.RedisCluster","msg":"Failed to ping Redis server","error":"dial tcp :6379: connect: connection refused","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisNodeID\n\t/workspace/k8sutils/cluster-scaling.go:116\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RemoveRedisFollowerNodesFromCluster\n\t/workspace/k8sutils/cluster-scaling.go:305\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:89\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:43Z","logger":"controllers.RedisCluster","msg":"Failed to get attached follower node IDs","masterNodeID":"","error":"ERR Unknown node ","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getAttachedFollowerNodeIDs\n\t/workspace/k8sutils/cluster-scaling.go:265\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RemoveRedisFollowerNodesFromCluster\n\t/workspace/k8sutils/cluster-scaling.go:306\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:89\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

And if we exec into a cluster pod and ask for the cluster node info we get:

2d8a2b04ac23cc89f7eb574286119c916a5d1734 10.244.1.147:6379@16379 master - 0 1728408466361 2 connected 5461-10922
8c4ec5f04ef490d52d5caa98ae4fdb05fbd3ce5c 10.244.2.134:6379@16379 master - 0 1728408465339 4 connected 10923-16383
7c1a1bdd1a05d0e4c0b3e2a181663299eeef382e 10.244.1.146:6379@16379 master,fail - 1728400700085 1728400698000 3 connected
fb5fd3a13987ddf177ed92c662376e31310eee42 10.244.1.148:6379@16379 myself,master - 0 1728408465000 1 connected 0-5460
7108c24dc4daaef4e52088d4ebc0a9759747eaed 10.244.2.131:6379@16379 slave 2d8a2b04ac23cc89f7eb574286119c916a5d1734 0 1728408464830 2 connected
e5113d9b7a35cab7c99689327f16df049b0d221c 10.244.2.132:6379@16379 slave fb5fd3a13987ddf177ed92c662376e31310eee42 0 1728408465000 1 connected

We believe the operator is promoting the follower to leader, but expects it to be named leader-3 instead of follower-x. This causes the updated of the stateful set to be blocked and we cannot rollback the cluster to a healthy state.

Is there a way to prevent failover from occurring and the promotion to occur?

The text was updated successfully, but these errors were encountered:

drivebyer · 2024-10-09T03:06:53Z

@Leo791, as the code shows:

// updateStatefulSet is a method to update statefulset in Kubernetes
func updateStatefulSet(cl kubernetes.Interface, logger logr.Logger, namespace string, stateful *appsv1.StatefulSet, recreateStateFulSet bool) error {
	_, err := cl.AppsV1().StatefulSets(namespace).Update(context.TODO(), stateful, metav1.UpdateOptions{})
	if recreateStateFulSet {
		sErr, ok := err.(*apierrors.StatusError)
		if ok && sErr.ErrStatus.Code == 422 && sErr.ErrStatus.Reason == metav1.StatusReasonInvalid {
			failMsg := make([]string, len(sErr.ErrStatus.Details.Causes))
			for messageCount, cause := range sErr.ErrStatus.Details.Causes {
				failMsg[messageCount] = cause.Message
			}
			logger.V(1).Info("recreating StatefulSet because the update operation wasn't possible", "reason", strings.Join(failMsg, ", "))
			propagationPolicy := metav1.DeletePropagationForeground
			if err := cl.AppsV1().StatefulSets(namespace).Delete(context.TODO(), stateful.GetName(), metav1.DeleteOptions{PropagationPolicy: &propagationPolicy}); err != nil { //nolint
				return errors.Wrap(err, "failed to delete StatefulSet to avoid forbidden action")
			}
		}
	}
	if err != nil {
		logger.Error(err, "Redis statefulset update failed")
		return err
	}
	logger.V(1).Info("Redis statefulset successfully updated ")
	return nil
}

StatefulSets are only deleted when you attempt to update forbidden fields, such as the persistentVolumeClaimTemplate field. Therefore, in my opinion, when a StatefulSet gets stuck in a pending state due to insufficient resources, we need to manually delete the StatefulSet (and its pods) under the current code design.

Leo791 · 2024-10-09T08:13:59Z

But isn't that the purpose of the annotation: redis.opstreelabs.in/recreate-statefulset: "true"?

drivebyer · 2024-10-09T08:18:53Z

But isn't that the purpose of the annotation: redis.opstreelabs.in/recreate-statefulset: "true"?

No, we only recreate the StatefulSet when there is an update to forbidden fields. We cannot recreate the StatefulSet when a pod is pending because we cannot determine whether the pending state is temporary or permanent.

Leo791 · 2024-10-09T08:31:04Z

Got it thank you!

And regarding the failover issue? Where the operator is looking for a pod that doesn't exist. Are you aware of this issue?

drivebyer · 2024-10-09T08:37:57Z

We believe the operator is promoting the follower to leader, but expects it to be named leader-3 instead of follower-x.

Actually, the role string in the pod name does not represent the actual role of the Redis node. We should not rely on the pod name to identify its role.

Is there a way to prevent failover from occurring and the promotion to occur?

Failover is handled by the Redis cluster itself, not by the operator. The operator simply creates resources and integrates them into a Redis cluster. Failover is automatically managed by the cluster.

Leo791 · 2024-10-09T09:29:49Z

Actually, the role string in the pod name does not represent the actual role of the Redis node. We should not rely on the pod name to identify its role.

But right now we are indeed using the pod name to identify the role no? And that's what's causing the problem we think.

Leo791 · 2024-10-09T15:32:30Z

How is the operator getting the pod roles and number of masters or slaves? Is it through cluster nodes? If so we believe that the fact that cluster nodes remains with a deleted master in master, fail state, is confusing the operator into expecting n+1 masters. And this is causing it to stay in an error state.

Leo791 added the bug Something isn't working label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed update breaks redis cluster #1097

Failed update breaks redis cluster #1097

Leo791 commented Oct 8, 2024

drivebyer commented Oct 9, 2024 •

edited

Loading

Leo791 commented Oct 9, 2024

drivebyer commented Oct 9, 2024

Leo791 commented Oct 9, 2024

drivebyer commented Oct 9, 2024

Leo791 commented Oct 9, 2024

Leo791 commented Oct 9, 2024

Failed update breaks redis cluster #1097

Failed update breaks redis cluster #1097

Comments

Leo791 commented Oct 8, 2024

drivebyer commented Oct 9, 2024 • edited Loading

Leo791 commented Oct 9, 2024

drivebyer commented Oct 9, 2024

Leo791 commented Oct 9, 2024

drivebyer commented Oct 9, 2024

Leo791 commented Oct 9, 2024

Leo791 commented Oct 9, 2024

drivebyer commented Oct 9, 2024 •

edited

Loading