talos_cluster_health times out when adding new nodes #221

rmvangun · 2024-12-19T08:18:35Z

When creating a basic Talos cluster with Terraform, I think there's a kind of logical problem that fails health checks in the plan phase when adding new nodes. The talos_cluster_health data source takes a list of all controlplane and workers. The first run, this is fine, the nodes are configured, and the health check passes. However, the second time, let's say you increase your worker node count. Now, the health check times out in the plan phase. The reason is that it now expects an additional worker node to be available. However, that node hasn't yet been given machine configuration in the apply step.

So, the health check runs during the plan phase, expects a new healthy worker node that has yet to be configured, and so fails the check.

Here's the cluster health config:

data "talos_cluster_health" "this" {
  depends_on = [talos_cluster_kubeconfig.this]

  client_configuration   = talos_machine_secrets.this.client_configuration
  control_plane_nodes    = module.controlplanes.*.node
  worker_nodes           = module.workers.*.node
  endpoints              = module.controlplanes.*.endpoint
}

Add a new node, and the plan times out with: data.talos_cluster_health.this: Still reading... [20s elapsed]

Not sure how to fix this, wish the health check just knew not to check on nodes that have yet to be configured in the planning phase.

The text was updated successfully, but these errors were encountered:

nebula-it · 2025-01-09T22:54:24Z

Facing the exact same issue here. We need health check only on the first run to confirm cluster is up before deploying CNI. There should be skip_node_check that can be used so it doesnt worry anything about nodes, just makes sure cluster endpoint is healthy.

frezbo · 2025-01-10T03:51:36Z

you'd need to use https://registry.terraform.io/providers/siderolabs/talos/latest/docs/data-sources/cluster_health#skip_kubernetes_checks-1

nebula-it · 2025-01-10T03:55:08Z

That does not work, it keeps trying to connect to the IP of new worker node (Since we are passing IPs of all workers nodes(calculated in a local var) the IP of new worker node is in there) and then times out

ionfury · 2025-01-20T19:15:25Z

I have the same issue. I was not able to find a workaround with the talos_health_check data source.

I reverted to using local-exec provisioners to check cluster health in a scaling-compatible manner link.

jonkerj · 2025-01-23T10:02:02Z

you'd need to use https://registry.terraform.io/providers/siderolabs/talos/latest/docs/data-sources/cluster_health#skip_kubernetes_checks-1

That does not seem to work in my case: the cluster is healthy Talos-wise, but nodes are not ready due to the CNI not being installed yet. For some reason, health is timing out (and I'm using skip_kubernetes_checks = true) indefinately.

I'd like to have the Cilium helm-release (that is going to fix that) depend on talos' health.

nebula-it · 2025-01-23T16:55:42Z

Hey @jonkerj, It does work for us without any CNI in place. We are also using it as pre-check to do the helm install. To confirm are you passing the IP address of all CP and worker nodes to it? Whats the error message you get at the end, when it fails/timesout?

jonkerj · 2025-01-24T08:50:28Z

You might be on to something, it's complaining about unexpected worker nodes:

waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for no diagnostics: ...
waiting for no diagnostics: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: unexpected nodes with IPs [REDACTED]

I left the worker nodes out, as I don't care about their health when installing Cilium (just want the k8s API server to be there), but apparantly, the check needs it. Having fixed this, it actually works. So, nevermind my previous comment, and thanks for this hint, @nebula-it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

talos_cluster_health times out when adding new nodes #221

talos_cluster_health times out when adding new nodes #221

rmvangun commented Dec 19, 2024 •

edited

Loading

nebula-it commented Jan 9, 2025

frezbo commented Jan 10, 2025

nebula-it commented Jan 10, 2025

ionfury commented Jan 20, 2025

jonkerj commented Jan 23, 2025

nebula-it commented Jan 23, 2025

jonkerj commented Jan 24, 2025

talos_cluster_health times out when adding new nodes #221

talos_cluster_health times out when adding new nodes #221

Comments

rmvangun commented Dec 19, 2024 • edited Loading

nebula-it commented Jan 9, 2025

frezbo commented Jan 10, 2025

nebula-it commented Jan 10, 2025

ionfury commented Jan 20, 2025

jonkerj commented Jan 23, 2025

nebula-it commented Jan 23, 2025

jonkerj commented Jan 24, 2025

rmvangun commented Dec 19, 2024 •

edited

Loading