You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Completely remove the first node in the list of nodes such that it is not accessible (we removed the VM completely) but do not modify rke state; cluster remains operational but degraded as one node has disappeared
Create a new replacement VM (in our case, this was with a new OS version) - it obtained a new IP address
Modify cluster.yml such that IP addr of the new node replaces the IP addr of the old node
Run rke up on the modified cluster.yml
Results:
The rke up process gets stuck removing the node.
Analysis:
I performed some troubleshooting and found the following:
rke reads in the set of nodes and compares with state as per the rkestate file. Note that in this case neither of these is fully representative of the current operational state of the system
rke identifies correctly that the old node must be removed
rke assumes the etcd cluster to be used is that defined in the cluster.yml; however this is not fully correct as etcd still has not been deployed on the first node
in the etcd node removal process, it attempts to connect to the first etcd node and never times out
More specifically, the RemoveEtcdMember function gets called with the member which should be removed and the desired set of etcd members in the cluster (which does not represent current state). It then iterates over this set here - if the first node in this set is not running etcd then the rke up gets blocked here and does not complete.
We have observed the same issue when adding an etcd member to the cluster and the first node in the list is not accessible.
Further Comments
Changing the order of the nodes fixes the issue, ie if the new node is anywhere but first in the list of nodes, the system figures out the current state and reconciles.
Performing an rke reconciliation after removing the Openstack VM would probably mean the problem does not manifest
In our case, we are using terraform provisioning which does not easily give us the option of removing the node from the cluster before adding the new node
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.
RKE version: 1.3.13
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Provisioned VMs on Openstack.
cluster.yml file:
Steps to Reproduce:
etcd
nodesrke
state; cluster remains operational but degraded as one node has disappearedcluster.yml
such that IP addr of the new node replaces the IP addr of the old noderke up
on the modifiedcluster.yml
Results:
rke up
process gets stuck removing the node.Analysis:
I performed some troubleshooting and found the following:
rke
reads in the set of nodes and compares with state as per therkestate
file. Note that in this case neither of these is fully representative of the current operational state of the systemrke
identifies correctly that the old node must be removedrke
assumes theetcd
cluster to be used is that defined in thecluster.yml
; however this is not fully correct asetcd
still has not been deployed on the first nodeetcd
node removal process, it attempts to connect to the firstetcd
node and never times outMore specifically, the
RemoveEtcdMember
function gets called with the member which should be removed and the desired set ofetcd
members in the cluster (which does not represent current state). It then iterates over this set here - if the first node in this set is not runningetcd
then therke up
gets blocked here and does not complete.We have observed the same issue when adding an
etcd
member to the cluster and the first node in the list is not accessible.Further Comments
rke
reconciliation after removing the Openstack VM would probably mean the problem does not manifestterraform
provisioning which does not easily give us the option of removing the node from the cluster before adding the new nodeThe text was updated successfully, but these errors were encountered: