Conflicting error messages on a non-functioning cluster #557

evilnick · 2024-07-17T12:36:57Z

Summary

In a multi-node cluster where one of the control-plane nodes has disappeared:

ubuntu@able-antelope:~$ sudo k8s status
Error: The node is not part of a Kubernetes cluster. You can bootstrap a new cluster with:

  sudo k8s bootstrap
ubuntu@able-antelope:~$ sudo k8s bootstrap
Error: The node is already part of a cluster
ubuntu@able-antelope:~$

What Should Happen Instead?

The first error is wrong. Status should instead report that the cluster is not in a working state, rather than it has not been bootstrapped

Reproduction Steps

Set up a cluster with two or more control planes
Remove one of the control planes
run status on the one which still exists

System information

inspection-report-20240717_133154.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

No response

The text was updated successfully, but these errors were encountered:

bschimke95 · 2024-08-22T07:21:07Z

@HomayoonAlimohammadi this should be addressed by #564, right?

HomayoonAlimohammadi · 2024-08-22T07:28:17Z

@bschimke95 I think we still had some problems which angelos fixed in #599
lemme try to reproduce this issue

HomayoonAlimohammadi · 2024-08-22T07:46:53Z

so here's what I did:

created a 3 node cluster (all cp)
killed one of the nodes (lxc delete --force)
run k8s status on one of the remaining nodes:
- the first time I got a deadline exceeded
- second time I got the status but the IP of removed node is still there and we can see heartbeats fail for that node:

Aug 22 07:37:40 cp1 k8s.k8sd[1804]: time="2024-08-22T07:37:40Z" level=error msg="Received error sending heartbeat to cluster member" error="Post \"https://10.97.72.146:6400/core/internal/heartbeat\": Unable to connect to \"10.97.72.146:6400\": dial tcp 10.97.72.146:6400: connect: no route to host" target="10.97.72.146:6400"

I retried this but this time instead of killing one node, did a k8s remove-node and everything seems fine. k8s status shows correct message on all nodes (existing and removed nodes) and the IP is removed. even removed 2 nodes in a 3 cp setup and still everything works fine.

maci3jka · 2024-11-02T07:55:29Z

My tests confirm @HomayoonAlimohammadi observations looks like fixed now

maci3jka closed this as completed Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflicting error messages on a non-functioning cluster #557

Conflicting error messages on a non-functioning cluster #557

evilnick commented Jul 17, 2024

bschimke95 commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024 •

edited

Loading

maci3jka commented Nov 2, 2024

Conflicting error messages on a non-functioning cluster #557

Conflicting error messages on a non-functioning cluster #557

Comments

evilnick commented Jul 17, 2024

Summary

What Should Happen Instead?

Reproduction Steps

System information

Can you suggest a fix?

Are you interested in contributing with a fix?

bschimke95 commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024

HomayoonAlimohammadi commented Aug 22, 2024 • edited Loading

maci3jka commented Nov 2, 2024

HomayoonAlimohammadi commented Aug 22, 2024 •

edited

Loading