upgrading control plane / etcd nodes fails #362

li-il-li · 2022-10-11T13:54:37Z

We are using the RKE provider v1.33 to provision our K8s on Openstack.
I am currently trying to upgrade our K8s nodes (which means step by step replacing those nodes) .
Part of this upgrade are obviously the control plane + etcd nodes (3 instances where on each instance etcd + cp is running together).
Upgrading is working fine for all but node 0.

To give some context here is how I'm upgrading our master nodes (etcd+controlplane) in terraform:

# Before upgrading node 2
master_image = {
    name  = "ubuntu-22.04-docker-x86_64"
    owner = null
    ha_nr_nodes = 3
    master_tags = [
      202210041232, # Node 0
      202210041232, # Node 1
      202210041232  # Node 2
    ]
}

# Upgrading node 2
master_image = {
    name  = "ubuntu-22.04-docker-x86_64"
    owner = null
    ha_nr_nodes = 3
    master_tags = [
      202210041232, # Node 0
      202210041232, # Node 1
      202210041459  # Node 2 (changed)
    ]
}

# => terraform apply 
# Repeat for node 1

On the Openstack side this tears down the old instance and provisions a new one with the new image.
Worker nodes and node 1 and 2 are getting replaced without any problems and appear as ready roughly 10min after terraform apply.

Running this operation to update node 0 on the other hand does not succeed.
We see that the RKE setup process gets stuck at the point where it tries to connect the first time to the etcd container on port 2379 (which we can see in the sshd logs). On node 1 and 2 there is an etcd instance running at this step in the process, on node 0 it is not.
We therefore assume that the missing etcd container is the root cause of our problem.

On the provider side the setup process does not terminate though. It seems to be stuck in a loop retrying to connect without any timeout or error in place to stop it (which you can see in the log node-0).

Digging deeper on node 0 we then saw that (besides etcd missing) the rancher/rke-tools image is getting instantiated as following containers:

rke-cp-port-listener
rke-etcd-port-listener
rke-port-checker

But those containers terminate right after their launch on node 0. Sadly they don't seem to generate any logs, which is why we are unable to provide more information.
The file-deployer container starts and finishes successfully.

I want to highlight again, that all those operations work on node 1 and 2, which is very confusing for us.

I attached the RKE logs for a working upgrade of node 1 and the failing upgrade of node 0.
They start to diverge at following entry:

level=info msg="[remove/etcd] Removing member
After this line you can see the repeated SSH connection attempts.

We are happy to run more tests and provide more information.

rke-node-0.log
rke-node-1.log

The text was updated successfully, but these errors were encountered:

WarpRat · 2023-07-20T22:32:46Z

Did you ever find a solution to this? We're hitting the same thing and now have our state stuck with an old node in there that we can't remove.

seanrmurphy · 2023-07-27T08:18:54Z

Luckily we encountered this during testing and we were able to find a workaround for our production systems; the workaround basically involves ensuring that any nodes which get added do not appear as first in the list of master nodes. This meant that we never encountered it in our production environments.

I'm pretty sure the best approach here is to take the rke config which is normally dumped to a temporary directory when terraform is doing the rke cluster reconciliation and change the order of the nodes in that. It may be necessary to reimport this to the tfstate subsequently.

We did make some notes on how to deal with scenarios in which we needed to drop out of the terraform world and do stuff with rke directly which I can root out if required.

github-actions bot added the team/area2 label Oct 11, 2022

seanrmurphy mentioned this issue Oct 17, 2022

Problem replacing node in cluster when node is first in list of nodes rancher/rke#3073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrading control plane / etcd nodes fails #362

upgrading control plane / etcd nodes fails #362

li-il-li commented Oct 11, 2022 •

edited

Loading

WarpRat commented Jul 20, 2023

seanrmurphy commented Jul 27, 2023

upgrading control plane / etcd nodes fails #362

upgrading control plane / etcd nodes fails #362

Comments

li-il-li commented Oct 11, 2022 • edited Loading

WarpRat commented Jul 20, 2023

seanrmurphy commented Jul 27, 2023

li-il-li commented Oct 11, 2022 •

edited

Loading