Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antipattern in status checks #186

Open
3 of 6 tasks
lllamnyp opened this issue Apr 21, 2024 · 6 comments
Open
3 of 6 tasks

Antipattern in status checks #186

lllamnyp opened this issue Apr 21, 2024 · 6 comments
Milestone

Comments

@lllamnyp
Copy link
Collaborator

lllamnyp commented Apr 21, 2024

TL;DR

This issue is acknowledged and work is in progress to resolve it:

Please read further to understand the background reasoning.

Summary

The status field of the EtcdCluster is self-referential and is not determined by the status of an actual etcd cluster.

Background

Kubernetes controlelrs are level based. This generally means, that a controller should be able to determine its status from the state of the surrounding environment. It is generally considered a bad idea to read the status of the root object (in this case, the EtcdCluster CR).

Remember, status should be able to be reconstituted from the state of the world, so it’s generally not a good idea to read from the status of the root object. Instead, you should reconstruct it every run.

Kubebuilder book — Implementing a controller

Controllers also do not distinguish between types of events, which trigger reconciliation (was a resource created, updated, deleted, etc). Trying to fight this restriction may cause complications further down the line. For example, implementing statuses, which describe the lifecycle of a resource and then relying on them to select a mode of operation of the controller can lead to new failure modes which need to be handled in different ways, growing the complexity of the codebase.

Issue

In its current state, the controller relies on an EtcdCluster's status field to determine the next steps in the reconciliation logic. If the controller reads an incorrect status without actually validating it, the etcd cluster will fail to launch. A contrived example can produce this issue as follows:

First apply the following yaml
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
  name: test
  namespace: default
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: etcd
        image: "nginx"
        args: ["-c", "sleep 3600"]
        command: ["/bin/sh"]
        livenessProbe:
          httpGet:
            host: ifconfig.me
            path: /
            port: 80
            scheme: HTTP
        readinessProbe:
          httpGet:
            host: ifconfig.me
            path: /
            port: 80
            scheme: HTTP
        startupProbe:
          httpGet:
            host: ifconfig.me
            path: /
            port: 80
            scheme: HTTP

This object will generate a statefulset which will launch and pass all readiness checks, but will not in fact build a working cluster, instead three containers will simply run sleep 3600. The cluster-state configmap will set the value of ETCD_INITIAL_CLUSTER_STATE to existing.

Next, attempt to fix the problem, by applying a correct manifest
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
  name: test
  namespace: default
spec:
  replicas: 3

The statefulset's podspec will be updated to correct values to launch an etcd cluster, but since the initial cluster state is erroneously set to "existing", a cluster will not be bootstrapped.

Analysis

In my opinion, the root cause of this problem is that no attempt is made to verify the "actual" status of the etcd cluster. Although at the moment the logic of the controller is quite simple and one must craft rather contrived bad inputs to break it, the examples do show that a failed transaction here or there, such as a dropped packet, a crashed node, etc, can leave an EtcdCluster (the custom resource) in an unrecoverable state.

Since implementing communication with the etcd cluster will be necessary in any case, for instance, to implement scaling up and down, I think it is prudent to start work on such a feature now and use it to determine the status of the cluster.

@lllamnyp
Copy link
Collaborator Author

There are two pieces of information:

  1. Does the cluster have quorum?
  2. Does the cluster-state configmap indicate ETCD_INITIAL_CLUSTER_STATE=existing?

If the cluster is in quorum:

  • Verify with etcdctl --endpoints=... endpoint status (endpoint health, member list, etc).
  • If yes, set ETCD_INITIAL_CLUSTER_STATE=existing, skip the next step, the cluster exists and is generally healthy.

If quorum is absent, but the cluster is known to have been bootstrapped (ETCD_INITIAL_CLUSTER_STATE=existing):

  • Check if any members are not running? Can they be started?
  • Reconcile again, after the managed pods' status changes. Perhaps requeue, set a deadline for a next attempt at reonciliation (cf. CrashLoopBackOff behavior).
  • If quorum is not eventually restored, etcd is in majority loss (or someone erroneously set the initial cluster state to existing, although there was no cluster).

Paths to recovery from majority failure are outside the scope of this issue and can be considered in a future release.

@lllamnyp
Copy link
Collaborator Author

If there's no configmap or it indicates that the cluster is new, this is normal create mode. The controller should wait for all pods to start running, quorum shall eventually be acquired. As before, timeouts/deadlines should apply.

A failure mode here is that it is possible that a cluster actually existed along with persistent volumes on some members. These members will ignore the cluster=new and other bootstrapping parameters. This failure mode is equivalent to a loss of quorum, but where initial cluster state was existing. Timeouts/deadlines apply, if the cluster doesn't come back online, set an error state, require manual intervention.

@kvaps
Copy link
Member

kvaps commented Apr 24, 2024

I would suggest that the operator automatically saves the cluster state in the resource spec, for example into options map:

spec.options["initial-cluster-state"] = "existing"

Since options is a map, this should not pose a problem for the user. It's quite normal to have multiple controllers working with one type of resource in Kubernetes and not interfere with each other due to three-way merge.

Such a resource can be backed up and restored at any time without fear of losing its status.

Accordingly, on the server side, this flag should be immutable, meaning it can be assigned but not changed, similar to the nodeName field in a pod spec.

@lllamnyp
Copy link
Collaborator Author

Majority failure scenarios

I've run some tests involving a majority failure wtih data loss. Here are my observations.

Etcd launch parameters are:

initial_state=new
initial_cluster=etcd-0=http://127.0.0.1:23800,etcd-1=http://127.0.0.1:23801,etcd-2=http://127.0.0.1:23802
# $i runs from 0 to 2
etcd \
        --initial-cluster $initial_cluster \
        --initial-cluster-state $initial_state \
        --initial-advertise-peer-urls http://127.0.0.1:2380$i \
        --listen-metrics-urls http://127.0.0.1:2381$i \
        --listen-peer-urls http://127.0.0.1:2380$i \
        --listen-client-urls http://127.0.0.1:2379$i \
        --advertise-client-urls http://127.0.0.1:2379$i \
        --name etcd-$i

Scenario 1: Surviving minority raft term greater than new cluster raft term

Steps to reproduce

  • Launch a 3-node etcd cluster.
  • Advance the raft term by putting keys and requesting etcdctl move-leader several times.
  • Stop all etcd members.
  • Delete data directories for etcd-1 and etcd-2.
  • Restart members etcd-1 and etcd-2 (they form a new cluster).
  • Observe cluster state (endpoint statuses, health, member list).
  • Restart member etcd-0.

Observations

  • Members etcd-1 and etcd-2 form a new cluster on startup.
  • They await for member etcd-0 to start and advertise its client url.
  • Once member etcd-0 is restored, members etcd-1 and etcd-2 discard their history and catch up with member etcd-0, presumably because etcd-0's raft term is ahead of the rest.
  • The keys and key history are present, as they were before the simulated majority failure.

Scenario 2: Minority raft term is behind new majority

Steps to reproduce

  • Launch cluster.
  • Put key foo=bar.
  • Stop all etcd members.
  • Delete data directories for etcd-1 and etcd-2.
  • Restart members etcd-1 and etcd-2 (they form a new cluster).
  • Put key foo=baz.
  • Advance the raft term by requesting etcdctl move-leader several times.
  • Restart member etcd-0.

Observations

  • As before, members etcd-1 and etcd-2 form a new cluster and await etcd-0.
  • On startup member etcd-0 never rejoins the cluster.
  • Instead, it repeatedly logs an error.
  • The data directory of etcd-0 remains intact and contains the key foo=bar.
  • As expected, the data dir of etcd-1 and etcd-2 contains foo=baz.

etcdctl member list can suggest if a member was ever seen. After simulating a majority failure, then restarting etcd-1 and etcd-2 to form a new cluster, the following is observed.

Before starting etcd-0.

+------------------+---------+--------+------------------------+------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS       |      CLIENT ADDRS      | IS LEARNER |
+------------------+---------+--------+------------------------+------------------------+------------+
|  2943212f5e2cf73 | started | etcd-2 | http://127.0.0.1:23802 | http://127.0.0.1:23792 |      false |
| 2e1259d415f2a040 | started | etcd-0 | http://127.0.0.1:23800 |                        |      false |
| bdae9bbc11dd390d | started | etcd-1 | http://127.0.0.1:23801 | http://127.0.0.1:23791 |      false |
+------------------+---------+--------+------------------------+------------------------+------------+

After starting etcd-0:

+------------------+---------+--------+------------------------+------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS       |      CLIENT ADDRS      | IS LEARNER |
+------------------+---------+--------+------------------------+------------------------+------------+
|  2943212f5e2cf73 | started | etcd-2 | http://127.0.0.1:23802 | http://127.0.0.1:23792 |      false |
| 2e1259d415f2a040 | started | etcd-0 | http://127.0.0.1:23800 | http://127.0.0.1:23790 |      false |
| bdae9bbc11dd390d | started | etcd-1 | http://127.0.0.1:23801 | http://127.0.0.1:23791 |      false |
+------------------+---------+--------+------------------------+------------------------+------------+

This happens regardless of the raft term index.


I conjecture that in most cases, this behavior is sufficient for the cluster to recover from a majority failure. If the cluster has existed for some time, its raft term can be expected to be quite large, while the failed-and-then-recovered majority that suffered data loss will have a raft term around 0 or 1. The successful first scenario did not require setting initial-cluster-state=existing to recover the cluster.

@lllamnyp
Copy link
Collaborator Author

More majority failure scenarios

Scenarios 1 & 2 all start with initial_cluster=etcd-0=http://127.0.0.1:23800,etcd-1=http://127.0.0.1:23801,etcd-2=http://127.0.0.1:23802.

This leads to consistent and repeatable member and cluster IDs and this is why etcd-1/2 happily merge back with etcd-0. Since the etcd cluster can potentially be created in one configuration, then change over time, as members are added and removed, the initial cluster parameters might not match those at bootstrap.

Scenario 3: majority failure with inconsistent cluster/member IDs

Steps to reproduce

  • Bootstrap an etcd cluster with members 0 and 1.
  • Scale up to three members, using member add, and running etcd-2 with initial-cluster-state=existing.
  • Put key-value pairs, move leader a few times to create some history in the cluster.
  • Stop all members.
  • Delete members 1 and 2 along with their data dirs.
  • Bootstrap a new etcd cluster with members 1 and 2.
  • Resume member 0.

Observations

  • The minority part (etcd-0) refuses to acknowledge the new majority (etcd-1/2) because of cluster ID mismatch.

Scenario 4: as before, but with initial-cluster-state=existing

Steps to reproduce

  • Repeat steps above up to, but not including bootstrapping etcd-1/2.
  • Launch members 1 and 2 with initial-cluster parameter including all three members and initial-cluster-state=existing.

Observations

  • All three members are unhealthy.

Scenario 5: recovery from majority failure

Steps to reproduce

  • Repeat steps from scenario 3 up to, but not including bootstrapping etcd-1/2.
  • Restart only etcd-0 with the parameter force-new-cluster.

Observations

  • A new single-node cluster is created.
  • The keyspace before the simulated failure is restored.

@lllamnyp
Copy link
Collaborator Author

lllamnyp commented Apr 30, 2024

WIP: Reconciliation scenarios

It looks like recovery from majority failure is possible, but is probably hard enough to implement that other lower-hanging fruit would be more productive. Instead I will consider some simpler reconciliation scenarios, which occur during normal operations, such as creating and updating a healthy cluster.

The scenarios, however, will not be "cluster creation" or "cluster scaling". The operator has no way of knowing, which kind of event triggered a reconciliation.

This is a first shot at working out the reconciliation loop. It is superceded by the flowchart linked below.

Step 1: an EtcdCluster resource is found and a reconcile is triggered

  • Can the cluster be accessed by its endpoints <name>-N.<name>-headless.<namespace>.svc?

Pros: If quorum exists, immediately short-circuit most status checks.

Cons: If scaling down is combined with an error on one of the members, results may be unreliable.

Step 2: Quorum does not exist

  • Is this a majority failure? Did any endpoints respond and report any kind of status?
  • Is this a cluster with persistence or ephemeral storage?

If there are some endpoints that respond with some kind of status, this is either a majority failure, or some time must be taken for objects in kubernetes to reach the desired state.

It should be helpful to ensure that the StatefulSet exists. This can lead to split-brain, but that can be verified from the endpoints' statuses by checking if they all have the same cluster ID or not.

If the cluster is running on ephemeral storage, it is either being created, or has failed. A field like progressDeadlineSeconds might be useful.

Step 3: No endpoints are responding

  • Most likely, the cluster is being created.

This is a good time to ensure the cluster-state configmap and fill in the ETCD_INITIAL_CLUSTER value. After this, as before, create a StatefulSet.

Flowchart

@kvaps kvaps added this to the v0.4.0 milestone May 14, 2024
@lllamnyp lllamnyp mentioned this issue Aug 13, 2024
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants