Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout for apiserver check #2362

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

whiterm
Copy link

@whiterm whiterm commented Aug 3, 2023

Issue #, if available:

Description of changes:
These changes increase the timeout for apiserver check. The old check with a 30 second timeout after one failed attempt immediately crashes during installation. In a configuration with loadbalancer, 30 seconds before apiservers is not enough.
The new behavior will try to check apiserver at least 5 times.

Error from enviroment with loadbalancer:

Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 2, hitting url HTTP://127.0.0.1:2381/health?exclude=NOSPACE&serializable=true ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: {"health":"true","reason":""}
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for probe check on pod: kube-apiserver
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://10.1.0.94:6443/livez
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://10.1.0.94:6443/livez ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for probe check on pod: kube-controller-manager
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://127.0.0.1:10257/healthz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://127.0.0.1:10257/healthz ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for probe check on pod: kube-scheduler
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://127.0.0.1:10259/healthz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://127.0.0.1:10259/healthz ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: map[mgmt:0xc000164780]
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: APIServer is https://10.1.0.175:6443
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url HTTPS://10.1.0.94:6443/readyz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url HTTPS://10.1.0.94:6443/readyz ******
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ok
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: Waiting for 200: OK on url https://10.1.0.175:6443/healthz
Aug 03 00:45:39 mgmt-5p2zv host-ctr[1107]: ******  Try 1, hitting url https://10.1.0.175:6443/healthz ******
Aug 03 00:46:39 mgmt-5p2zv host-ctr[1107]: Error occured while hitting url: Get "https://10.1.0.175:6443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Aug 03 00:46:49 mgmt-5p2zv host-ctr[1107]: Error running bootstrapper cmd: error initing controlplane: Timeout occurred while waiting for 200 OK
Aug 03 00:46:49 mgmt-5p2zv host-ctr[1107]: time="2023-08-03T00:46:49Z" level=info msg="container task exited" code=1
Aug 03 00:46:49 mgmt-5p2zv host-ctr[1107]: time="2023-08-03T00:46:49Z" level=fatal msg="Container kubeadm-bootstrap exited with non-zero status"
Aug 03 00:46:49 mgmt-5p2zv systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Aug 03 00:46:49 mgmt-5p2zv systemd[1]: [email protected]: Failed with result 'exit-code'.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign a-cool-train for approval by writing /assign @a-cool-train in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot
Copy link
Collaborator

Hi @whiterm. Thanks for your PR.

I'm waiting for a aws member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eks-distro-bot eks-distro-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 3, 2023
@jaxesn
Copy link
Member

jaxesn commented Aug 4, 2023

Thanks for the contribution!

Can you give me a bit more about your setup? Do you have a LB that is being provisioned during the cluster creation process, is that where 10.1.0.175 comes from? And this LB may take a few mins to fully provision and point to the newly created CP node?

@whiterm
Copy link
Author

whiterm commented Aug 4, 2023

@jaxesn
LBs, control plane nodes, and worker nodes are created during infrastructure provisioning. This LB forwards traffic only to control plane nodes. Once the infrastructure is ready, I start creating a kubernetes cluster using the tinkerbell provider in EKS-A. This error does not always appear, sometimes the cluster is created without errors. Apparently 30 seconds is not enough for the LB to determine which control plane is ready to consume the traffic.
This error appears only on Bottlerocket OS, when using Ubuntu there are no errors.

@jaxesn
Copy link
Member

jaxesn commented Aug 4, 2023

What is the 10.1.0.175 IP address? is that the kube-vip/control-plane-ip?

@whiterm
Copy link
Author

whiterm commented Aug 4, 2023

@jaxesn
10.1.0.175 is the IP address of the load balancer. I disable kube-vip with skipLoadBalancerDeployment: true

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: simple-cluster
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneConfiguration:
    count: ${CONTROL_PLANES}
    skipLoadBalancerDeployment: true
    endpoint:
      host: "${LOADBALANCER_IP}"
...

@jaxesn
Copy link
Member

jaxesn commented Aug 7, 2023

Awesome, thanks for the info! What is configuring the LB to have the newly created CP node IPs? Is that something you are running externally to create cluster process?

@whiterm
Copy link
Author

whiterm commented Aug 7, 2023

@jaxesn
I have a script that first creates a group of control plane nodes, then configures the load balancer to only forward traffic to control plane nodes with node health check over tcp, and then starts the cluster creation process ssh -F ssh-config boot-user@bootstrap 'sudo eksctl create cluster anywhere --hardware-csv hardware.csv -f cluster.yaml'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ok-to-test size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants