Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrading control plane / etcd nodes fails #362

Open
li-il-li opened this issue Oct 11, 2022 · 2 comments
Open

upgrading control plane / etcd nodes fails #362

li-il-li opened this issue Oct 11, 2022 · 2 comments

Comments

@li-il-li
Copy link

li-il-li commented Oct 11, 2022

We are using the RKE provider v1.33 to provision our K8s on Openstack.
I am currently trying to upgrade our K8s nodes (which means step by step replacing those nodes) .
Part of this upgrade are obviously the control plane + etcd nodes (3 instances where on each instance etcd + cp is running together).
Upgrading is working fine for all but node 0.

To give some context here is how I'm upgrading our master nodes (etcd+controlplane) in terraform:

# Before upgrading node 2
master_image = {
    name  = "ubuntu-22.04-docker-x86_64"
    owner = null
    ha_nr_nodes = 3
    master_tags = [
      202210041232, # Node 0
      202210041232, # Node 1
      202210041232  # Node 2
    ]
}

# Upgrading node 2
master_image = {
    name  = "ubuntu-22.04-docker-x86_64"
    owner = null
    ha_nr_nodes = 3
    master_tags = [
      202210041232, # Node 0
      202210041232, # Node 1
      202210041459  # Node 2 (changed)
    ]
}

# => terraform apply 
# Repeat for node 1

On the Openstack side this tears down the old instance and provisions a new one with the new image.
Worker nodes and node 1 and 2 are getting replaced without any problems and appear as ready roughly 10min after terraform apply.

Running this operation to update node 0 on the other hand does not succeed.
We see that the RKE setup process gets stuck at the point where it tries to connect the first time to the etcd container on port 2379 (which we can see in the sshd logs). On node 1 and 2 there is an etcd instance running at this step in the process, on node 0 it is not.
We therefore assume that the missing etcd container is the root cause of our problem.

On the provider side the setup process does not terminate though. It seems to be stuck in a loop retrying to connect without any timeout or error in place to stop it (which you can see in the log node-0).

Digging deeper on node 0 we then saw that (besides etcd missing) the rancher/rke-tools image is getting instantiated as following containers:

  • rke-cp-port-listener
  • rke-etcd-port-listener
  • rke-port-checker

But those containers terminate right after their launch on node 0. Sadly they don't seem to generate any logs, which is why we are unable to provide more information.
The file-deployer container starts and finishes successfully.

I want to highlight again, that all those operations work on node 1 and 2, which is very confusing for us.

I attached the RKE logs for a working upgrade of node 1 and the failing upgrade of node 0.
They start to diverge at following entry:

level=info msg="[remove/etcd] Removing member
After this line you can see the repeated SSH connection attempts.

We are happy to run more tests and provide more information.

rke-node-0.log
rke-node-1.log

@WarpRat
Copy link

WarpRat commented Jul 20, 2023

Did you ever find a solution to this? We're hitting the same thing and now have our state stuck with an old node in there that we can't remove.

@seanrmurphy
Copy link

Luckily we encountered this during testing and we were able to find a workaround for our production systems; the workaround basically involves ensuring that any nodes which get added do not appear as first in the list of master nodes. This meant that we never encountered it in our production environments.

I'm pretty sure the best approach here is to take the rke config which is normally dumped to a temporary directory when terraform is doing the rke cluster reconciliation and change the order of the nodes in that. It may be necessary to reimport this to the tfstate subsequently.

We did make some notes on how to deal with scenarios in which we needed to drop out of the terraform world and do stuff with rke directly which I can root out if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants