Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

talos_cluster_health times out when adding new nodes #221

Open
rmvangun opened this issue Dec 19, 2024 · 7 comments
Open

talos_cluster_health times out when adding new nodes #221

rmvangun opened this issue Dec 19, 2024 · 7 comments

Comments

@rmvangun
Copy link

rmvangun commented Dec 19, 2024

When creating a basic Talos cluster with Terraform, I think there's a kind of logical problem that fails health checks in the plan phase when adding new nodes. The talos_cluster_health data source takes a list of all controlplane and workers. The first run, this is fine, the nodes are configured, and the health check passes. However, the second time, let's say you increase your worker node count. Now, the health check times out in the plan phase. The reason is that it now expects an additional worker node to be available. However, that node hasn't yet been given machine configuration in the apply step.

So, the health check runs during the plan phase, expects a new healthy worker node that has yet to be configured, and so fails the check.

Here's the cluster health config:

data "talos_cluster_health" "this" {
  depends_on = [talos_cluster_kubeconfig.this]

  client_configuration   = talos_machine_secrets.this.client_configuration
  control_plane_nodes    = module.controlplanes.*.node
  worker_nodes           = module.workers.*.node
  endpoints              = module.controlplanes.*.endpoint
}

Add a new node, and the plan times out with: data.talos_cluster_health.this: Still reading... [20s elapsed]

Not sure how to fix this, wish the health check just knew not to check on nodes that have yet to be configured in the planning phase.

@nebula-it
Copy link

Facing the exact same issue here. We need health check only on the first run to confirm cluster is up before deploying CNI. There should be skip_node_check that can be used so it doesnt worry anything about nodes, just makes sure cluster endpoint is healthy.

@nebula-it
Copy link

That does not work, it keeps trying to connect to the IP of new worker node (Since we are passing IPs of all workers nodes(calculated in a local var) the IP of new worker node is in there) and then times out

@ionfury
Copy link

ionfury commented Jan 20, 2025

I have the same issue. I was not able to find a workaround with the talos_health_check data source.

I reverted to using local-exec provisioners to check cluster health in a scaling-compatible manner link.

@jonkerj
Copy link

jonkerj commented Jan 23, 2025

you'd need to use https://registry.terraform.io/providers/siderolabs/talos/latest/docs/data-sources/cluster_health#skip_kubernetes_checks-1

That does not seem to work in my case: the cluster is healthy Talos-wise, but nodes are not ready due to the CNI not being installed yet. For some reason, health is timing out (and I'm using skip_kubernetes_checks = true) indefinately.

I'd like to have the Cilium helm-release (that is going to fix that) depend on talos' health.

@nebula-it
Copy link

Hey @jonkerj, It does work for us without any CNI in place. We are also using it as pre-check to do the helm install. To confirm are you passing the IP address of all CP and worker nodes to it? Whats the error message you get at the end, when it fails/timesout?

@jonkerj
Copy link

jonkerj commented Jan 24, 2025

You might be on to something, it's complaining about unexpected worker nodes:

waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for no diagnostics: ...
waiting for no diagnostics: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: unexpected nodes with IPs [REDACTED]

I left the worker nodes out, as I don't care about their health when installing Cilium (just want the k8s API server to be there), but apparantly, the check needs it. Having fixed this, it actually works. So, nevermind my previous comment, and thanks for this hint, @nebula-it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants