Increase timeout for apiserver check #2362

whiterm · 2023-08-03T15:51:43Z

Issue #, if available:

Description of changes:
These changes increase the timeout for apiserver check. The old check with a 30 second timeout after one failed attempt immediately crashes during installation. In a configuration with loadbalancer, 30 seconds before apiservers is not enough.
The new behavior will try to check apiserver at least 5 times.

Error from enviroment with loadbalancer:

Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 2, hitting url HTTP://127.0.0.1:2381/health?exclude=NOSPACE&serializable=true ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: {"health":"true","reason":""}
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for probe check on pod: kube-apiserver
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://10.1.0.94:6443/livez
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://10.1.0.94:6443/livez ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for probe check on pod: kube-controller-manager
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://127.0.0.1:10257/healthz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://127.0.0.1:10257/healthz ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for probe check on pod: kube-scheduler
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://127.0.0.1:10259/healthz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://127.0.0.1:10259/healthz ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: map[mgmt:0xc000164780]
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: APIServer is https://10.1.0.175:6443
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://10.1.0.94:6443/readyz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://10.1.0.94:6443/readyz ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url https://10.1.0.175:6443/healthz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url https://10.1.0.175:6443/healthz ******
Aug 03 00:46:39 mgmt-5p2zv host-ctr[1107]: Error occured while hitting url: Get "https://10.1.0.175:6443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Aug 03 00:46:49 mgmt-5p2zv host-ctr[1107]: Error running bootstrapper cmd: error initing controlplane: Timeout occurred while waiting for 200 OK
Aug 03 00:46:49 mgmt-5p2zv host-ctr[1107]: time="2023-08-03T00:46:49Z" level=info msg="container task exited" code=1
Aug 03 00:46:49 mgmt-5p2zv host-ctr[1107]: time="2023-08-03T00:46:49Z" level=fatal msg="Container kubeadm-bootstrap exited with non-zero status"
Aug 03 00:46:49 mgmt-5p2zv systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Aug 03 00:46:49 mgmt-5p2zv systemd[1]: [email protected]: Failed with result 'exit-code'.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

eks-distro-bot · 2023-08-03T15:51:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign a-cool-train for approval by writing /assign @a-cool-train in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eks-distro-bot · 2023-08-03T15:51:55Z

Hi @whiterm. Thanks for your PR.

I'm waiting for a aws member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jaxesn · 2023-08-04T15:07:58Z

Thanks for the contribution!

Can you give me a bit more about your setup? Do you have a LB that is being provisioned during the cluster creation process, is that where 10.1.0.175 comes from? And this LB may take a few mins to fully provision and point to the newly created CP node?

whiterm · 2023-08-04T15:56:10Z

@jaxesn
LBs, control plane nodes, and worker nodes are created during infrastructure provisioning. This LB forwards traffic only to control plane nodes. Once the infrastructure is ready, I start creating a kubernetes cluster using the tinkerbell provider in EKS-A. This error does not always appear, sometimes the cluster is created without errors. Apparently 30 seconds is not enough for the LB to determine which control plane is ready to consume the traffic.
This error appears only on Bottlerocket OS, when using Ubuntu there are no errors.

jaxesn · 2023-08-04T19:07:38Z

What is the 10.1.0.175 IP address? is that the kube-vip/control-plane-ip?

whiterm · 2023-08-04T21:05:11Z

@jaxesn
10.1.0.175 is the IP address of the load balancer. I disable kube-vip with skipLoadBalancerDeployment: true

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: simple-cluster
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneConfiguration:
    count: ${CONTROL_PLANES}
    skipLoadBalancerDeployment: true
    endpoint:
      host: "${LOADBALANCER_IP}"
...

jaxesn · 2023-08-07T15:54:17Z

Awesome, thanks for the info! What is configuring the LB to have the newly created CP node IPs? Is that something you are running externally to create cluster process?

whiterm · 2023-08-07T16:17:43Z

@jaxesn
I have a script that first creates a group of control plane nodes, then configures the load balancer to only forward traffic to control plane nodes with node health check over tcp, and then starts the cluster creation process ssh -F ssh-config boot-user@bootstrap 'sudo eksctl create cluster anywhere --hardware-csv hardware.csv -f cluster.yaml'

Increase timeout for apiserver check

cd1d9f1

eks-distro-bot added the needs-ok-to-test label Aug 3, 2023

eks-distro-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase timeout for apiserver check #2362

Increase timeout for apiserver check #2362

whiterm commented Aug 3, 2023

eks-distro-bot commented Aug 3, 2023

eks-distro-bot commented Aug 3, 2023

jaxesn commented Aug 4, 2023

whiterm commented Aug 4, 2023 •

edited

Loading

jaxesn commented Aug 4, 2023

whiterm commented Aug 4, 2023 •

edited

Loading

jaxesn commented Aug 7, 2023

whiterm commented Aug 7, 2023 •

edited

Loading

Increase timeout for apiserver check #2362

Are you sure you want to change the base?

Increase timeout for apiserver check #2362

Conversation

whiterm commented Aug 3, 2023

eks-distro-bot commented Aug 3, 2023

eks-distro-bot commented Aug 3, 2023

jaxesn commented Aug 4, 2023

whiterm commented Aug 4, 2023 • edited Loading

jaxesn commented Aug 4, 2023

whiterm commented Aug 4, 2023 • edited Loading

jaxesn commented Aug 7, 2023

whiterm commented Aug 7, 2023 • edited Loading

whiterm commented Aug 4, 2023 •

edited

Loading

whiterm commented Aug 4, 2023 •

edited

Loading

whiterm commented Aug 7, 2023 •

edited

Loading