Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: xline cluster will enter a frozen state after multiple crash and recoveries #402

Open
1 task done
iGxnon opened this issue Jul 27, 2023 · 0 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@iGxnon
Copy link
Contributor

iGxnon commented Jul 27, 2023

Description about the bug

Description about the bug

After multiple crash recoveries, xline will enter a frozen state even if the configurations of xline servers is normal.

In xline-kv/xline-operator#16, xline-operator introduced a simple chaos validation. The specific logic is as follows:

monkeys() {
  size=$1
  iters=$2
  max_kill=$((size / 2))
  echo "monkeys: size=$size, iters=$iters, max_kill=$max_kill"
  for ((i = 0; i < iters; i++)); do
    case $(random 3) in
    0)
      echo "monkeys: put get"
      value=$(random 100)
      run_expect "put A $value" "OK"
      run_expect "get A" "A\n$value"
      ;;
    1)
      echo "monkeys: drop pods"
      # Before deleting the pod, execute "put get" to ensure that the cluster works properly.
      run_expect "put A 1" "OK"
      run_expect "get A" "A\n1"
      # Get the current number of active nodes.
      ready=$(kubectl get sts/$CLUSTER_NAME -o=jsonpath='{.status.readyReplicas}')
      # Calculate the size to be killed
      killed=$((ready + max_kill - size))
      for ((y = 0; y < killed; y++)); do
        name=$CLUSTER_NAME-$(random "$size")
        kubectl delete pod/"$name" --force --grace-period=0 2>/dev/null
      done
      ;;
    2)
      echo "monkeys: wait for pods"
      kubectl wait --for=jsonpath='{.status.readyReplicas}'="$size" sts/$CLUSTER_NAME --timeout=10m
      ;;
    esac
  done
}

After multiple iterations, the cluster will enter a frozen state. At this point, executing a GET request will produce the following output:

$ kubectl exec pod/tester -c etcdctl -- bash -c "ETCDCTL_API=3 etcdctl --endpoints='http://my-xline-cluster-2.my-xline-cluster.default.svc.cluster.local:2379' get A"
{"level":"warn","ts":"2023-07-27T10:24:08.495Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x40001bc000/my-xline-cluster-2.my-xline-cluster.default.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
command terminated with exit code 1

And the pods status displayed by kubectl is healthy:

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS      AGE
my-xline-cluster-0                   1/1     Running   0             15m
my-xline-cluster-1                   1/1     Running   0             15m
my-xline-cluster-2                   1/1     Running   0             16m
my-xline-cluster-3                   1/1     Running   0             16m
my-xline-cluster-4                   1/1     Running   0             16m
my-xline-operator-6b5979899b-7m8q7   1/1     Running   4 (84m ago)   2d5h
my-xline-operator-6b5979899b-jqw9b   1/1     Running   4 (84m ago)   2d5h
my-xline-operator-6b5979899b-pmn9d   1/1     Running   4 (84m ago)   2d5h
tester                               1/1     Running   0             12m

Version

0.4.1 (Default)

Relevant log output

# Some common logs among the cluster

2023-07-27T10:12:33.248384Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248392Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248398Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248913Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:35.249666Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:37.360188Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:39.458791Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458803Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458814Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458972Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:41.459056Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:43.566937Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:51.763590Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.763874Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.763938Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.764114Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:53.763917Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:55.879934Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }

Code of Conduct

  • I agree to follow this project's Code of Conduct
@iGxnon iGxnon added the bug Something isn't working label Jul 27, 2023
@iGxnon iGxnon changed the title [Bug]: [Bug]: xline cluster will enter a frozen state after multiple crash and recoveries Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant