-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
talos_cluster_health times out when adding new nodes #221
Comments
Facing the exact same issue here. We need health check only on the first run to confirm cluster is up before deploying CNI. There should be |
That does not work, it keeps trying to connect to the IP of new worker node (Since we are passing IPs of all workers nodes(calculated in a local var) the IP of new worker node is in there) and then times out |
I have the same issue. I was not able to find a workaround with the I reverted to using local-exec provisioners to check cluster health in a scaling-compatible manner link. |
That does not seem to work in my case: the cluster is healthy Talos-wise, but nodes are not ready due to the CNI not being installed yet. For some reason, health is timing out (and I'm using I'd like to have the Cilium helm-release (that is going to fix that) depend on talos' health. |
Hey @jonkerj, It does work for us without any CNI in place. We are also using it as pre-check to do the helm install. To confirm are you passing the IP address of all CP and worker nodes to it? Whats the error message you get at the end, when it fails/timesout? |
You might be on to something, it's complaining about unexpected worker nodes:
I left the worker nodes out, as I don't care about their health when installing Cilium (just want the k8s API server to be there), but apparantly, the check needs it. Having fixed this, it actually works. So, nevermind my previous comment, and thanks for this hint, @nebula-it. |
When creating a basic Talos cluster with Terraform, I think there's a kind of logical problem that fails health checks in the plan phase when adding new nodes. The
talos_cluster_health
data source takes a list of all controlplane and workers. The first run, this is fine, the nodes are configured, and the health check passes. However, the second time, let's say you increase your worker node count. Now, the health check times out in the plan phase. The reason is that it now expects an additional worker node to be available. However, that node hasn't yet been given machine configuration in the apply step.So, the health check runs during the plan phase, expects a new healthy worker node that has yet to be configured, and so fails the check.
Here's the cluster health config:
Add a new node, and the plan times out with:
data.talos_cluster_health.this: Still reading... [20s elapsed]
Not sure how to fix this, wish the health check just knew not to check on nodes that have yet to be configured in the planning phase.
The text was updated successfully, but these errors were encountered: