Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed update breaks redis cluster #1097

Open
Leo791 opened this issue Oct 8, 2024 · 7 comments
Open

Failed update breaks redis cluster #1097

Leo791 opened this issue Oct 8, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@Leo791
Copy link

Leo791 commented Oct 8, 2024

What version of redis operator are you using?

redis-operator version:
v0.18.0

Does this issue reproduce with the latest release?
Yes

What operating system and processor architecture are you using (kubectl version)?

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

What did you do?

  • Created a redis cluster with 3 shards, each node has 1 cpu and 2Gi of memory. Added the annotation redis.opstreelabs.in/recreate-statefulset: "true"to the crd
  • Updated the cluster to 2 cpu and 4Gi, the rolling update fails halfway around the leaders due to lack of resources.
  • Tried to rollback the update by updating the crd to 1 cpu and 2Gi of memory again.
  • Deleted the pending pod

What did you expect to see?

  • The stateful set updated
  • The pods that had been updated being updated again to the initial values

What did you see instead?

  • The stateful set is not updated
  • The pod restarts again as pending
  • No pods are restarted with the new values

The operator throws the following errors:

{"level":"error","ts":"2024-10-08T17:33:40Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"redis-instance8","podName":"redis-instance8-leader-3","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.configureRedisClient\n\t/workspace/k8sutils/redis.go:383\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.VerifyLeaderPod\n\t/workspace/k8sutils/cluster-scaling.go:372\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:82\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:40Z","logger":"controllers.RedisCluster","msg":"Failed to Get the role Info of the","redis pod":"redis-instance8-leader-3","error":"dial tcp :6379: connect: connection refused","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.verifyLeaderPodInfo\n\t/workspace/k8sutils/cluster-scaling.go:380\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.VerifyLeaderPod\n\t/workspace/k8sutils/cluster-scaling.go:374\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:82\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:41Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"redis-instance8","podName":"redis-instance8-leader-3","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ClusterFailover\n\t/workspace/k8sutils/cluster-scaling.go:409\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:86\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:41Z","logger":"controllers.RedisCluster","msg":"Could not get pod info","Pod Name":"redis-instance8-leader-3","Namespace":"redis-instance8","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getContainerID\n\t/workspace/k8sutils/redis.go:447\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand1\n\t/workspace/k8sutils/redis.go:413\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand\n\t/workspace/k8sutils/redis.go:395\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ClusterFailover\n\t/workspace/k8sutils/cluster-scaling.go:424\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:86\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:41Z","logger":"controllers.RedisCluster","msg":"Could not find pod to execute","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand1\n\t/workspace/k8sutils/redis.go:415\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand\n\t/workspace/k8sutils/redis.go:395\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ClusterFailover\n\t/workspace/k8sutils/cluster-scaling.go:424\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:86\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:43Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"redis-instance8","podName":"redis-instance8-leader-3","error":"pods \"redis-instance8-leader-3\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.configureRedisClient\n\t/workspace/k8sutils/redis.go:383\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisNodeID\n\t/workspace/k8sutils/cluster-scaling.go:111\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RemoveRedisFollowerNodesFromCluster\n\t/workspace/k8sutils/cluster-scaling.go:305\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:89\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:43Z","logger":"controllers.RedisCluster","msg":"Failed to ping Redis server","error":"dial tcp :6379: connect: connection refused","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisNodeID\n\t/workspace/k8sutils/cluster-scaling.go:116\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RemoveRedisFollowerNodesFromCluster\n\t/workspace/k8sutils/cluster-scaling.go:305\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:89\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-10-08T17:33:43Z","logger":"controllers.RedisCluster","msg":"Failed to get attached follower node IDs","masterNodeID":"","error":"ERR Unknown node ","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getAttachedFollowerNodeIDs\n\t/workspace/k8sutils/cluster-scaling.go:265\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RemoveRedisFollowerNodesFromCluster\n\t/workspace/k8sutils/cluster-scaling.go:306\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:89\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

And if we exec into a cluster pod and ask for the cluster node info we get:

2d8a2b04ac23cc89f7eb574286119c916a5d1734 10.244.1.147:6379@16379 master - 0 1728408466361 2 connected 5461-10922
8c4ec5f04ef490d52d5caa98ae4fdb05fbd3ce5c 10.244.2.134:6379@16379 master - 0 1728408465339 4 connected 10923-16383
7c1a1bdd1a05d0e4c0b3e2a181663299eeef382e 10.244.1.146:6379@16379 master,fail - 1728400700085 1728400698000 3 connected
fb5fd3a13987ddf177ed92c662376e31310eee42 10.244.1.148:6379@16379 myself,master - 0 1728408465000 1 connected 0-5460
7108c24dc4daaef4e52088d4ebc0a9759747eaed 10.244.2.131:6379@16379 slave 2d8a2b04ac23cc89f7eb574286119c916a5d1734 0 1728408464830 2 connected
e5113d9b7a35cab7c99689327f16df049b0d221c 10.244.2.132:6379@16379 slave fb5fd3a13987ddf177ed92c662376e31310eee42 0 1728408465000 1 connected

We believe the operator is promoting the follower to leader, but expects it to be named leader-3 instead of follower-x. This causes the updated of the stateful set to be blocked and we cannot rollback the cluster to a healthy state.

Is there a way to prevent failover from occurring and the promotion to occur?

@Leo791 Leo791 added the bug Something isn't working label Oct 8, 2024
@drivebyer
Copy link
Collaborator

drivebyer commented Oct 9, 2024

@Leo791, as the code shows:

// updateStatefulSet is a method to update statefulset in Kubernetes
func updateStatefulSet(cl kubernetes.Interface, logger logr.Logger, namespace string, stateful *appsv1.StatefulSet, recreateStateFulSet bool) error {
	_, err := cl.AppsV1().StatefulSets(namespace).Update(context.TODO(), stateful, metav1.UpdateOptions{})
	if recreateStateFulSet {
		sErr, ok := err.(*apierrors.StatusError)
		if ok && sErr.ErrStatus.Code == 422 && sErr.ErrStatus.Reason == metav1.StatusReasonInvalid {
			failMsg := make([]string, len(sErr.ErrStatus.Details.Causes))
			for messageCount, cause := range sErr.ErrStatus.Details.Causes {
				failMsg[messageCount] = cause.Message
			}
			logger.V(1).Info("recreating StatefulSet because the update operation wasn't possible", "reason", strings.Join(failMsg, ", "))
			propagationPolicy := metav1.DeletePropagationForeground
			if err := cl.AppsV1().StatefulSets(namespace).Delete(context.TODO(), stateful.GetName(), metav1.DeleteOptions{PropagationPolicy: &propagationPolicy}); err != nil { //nolint
				return errors.Wrap(err, "failed to delete StatefulSet to avoid forbidden action")
			}
		}
	}
	if err != nil {
		logger.Error(err, "Redis statefulset update failed")
		return err
	}
	logger.V(1).Info("Redis statefulset successfully updated ")
	return nil
}

StatefulSets are only deleted when you attempt to update forbidden fields, such as the persistentVolumeClaimTemplate field. Therefore, in my opinion, when a StatefulSet gets stuck in a pending state due to insufficient resources, we need to manually delete the StatefulSet (and its pods) under the current code design.

@Leo791
Copy link
Author

Leo791 commented Oct 9, 2024

But isn't that the purpose of the annotation: redis.opstreelabs.in/recreate-statefulset: "true"?

@drivebyer
Copy link
Collaborator

But isn't that the purpose of the annotation: redis.opstreelabs.in/recreate-statefulset: "true"?

No, we only recreate the StatefulSet when there is an update to forbidden fields. We cannot recreate the StatefulSet when a pod is pending because we cannot determine whether the pending state is temporary or permanent.

@Leo791
Copy link
Author

Leo791 commented Oct 9, 2024

Got it thank you!

And regarding the failover issue? Where the operator is looking for a pod that doesn't exist. Are you aware of this issue?

@drivebyer
Copy link
Collaborator

We believe the operator is promoting the follower to leader, but expects it to be named leader-3 instead of follower-x.

Actually, the role string in the pod name does not represent the actual role of the Redis node. We should not rely on the pod name to identify its role.

Is there a way to prevent failover from occurring and the promotion to occur?

Failover is handled by the Redis cluster itself, not by the operator. The operator simply creates resources and integrates them into a Redis cluster. Failover is automatically managed by the cluster.

@Leo791
Copy link
Author

Leo791 commented Oct 9, 2024

Actually, the role string in the pod name does not represent the actual role of the Redis node. We should not rely on the pod name to identify its role.

But right now we are indeed using the pod name to identify the role no? And that's what's causing the problem we think.

@Leo791
Copy link
Author

Leo791 commented Oct 9, 2024

How is the operator getting the pod roles and number of masters or slaves? Is it through cluster nodes? If so we believe that the fact that cluster nodes remains with a deleted master in master, fail state, is confusing the operator into expecting n+1 masters. And this is causing it to stay in an error state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants