Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a new created node after updating is not able to reach the t1 edge. #504

Open
TimTinneveld opened this issue Aug 10, 2023 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@TimTinneveld
Copy link

Describe the bug

After redeployment/updates of controlplanes or workers, new created vm's get directly the same ip-address of the deleted old vm (because that is the first free ip in the pool). This can create some issues because most of the time the given templates from vmware don't send any arp requests that they are using the already used ip. NSX-T by default discovers new ip's after a time-out from 8 minutes (the arp table in nsx-t doesn't get directly updated). because of the arp issues the vm cannot ping the t1 router, because of this it can also not reach the api and join the cluster.

So far i have been able to activate the retry join option to the cluster, but this makes the vm creation time arround 10 minutes.
The second solution was to add a arping after booting the vm. now vm creation times are arround 2-3 minutes what is acceptable for me.

i noticed that after running the command: “arping -U -I ens192 <VM_SELF_IP_address> -c 3" the tier 1 router becomes responsive. As a fix i have added this command in the template. this brings the creation time to a steady 2-3 minutes all the time.

Reproduction steps

  1. updating nodes to a new image
  2. after one node is deleted and a new node is being created with the same ip as the deleted node.
  3. the new node cannot reach the tier 1 gateway. because of this it cannot join the cluster.
    ...

Expected behavior

The new created node should be able to reach the gateway directly and lower down the creation time.

Additional context

The error that can be seen when loadbalancer is not reachable: “[preflight] Running pre-flight checks
[2023-05-26 14:59:39] error execution phase preflight: couldn’t validate the identity of the API Server: Get “https://****:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s”: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
[2023-05-26 14:59:39] To see the stack trace of this error execute with --v=5 or higher”

@TimTinneveld TimTinneveld added the bug Something isn't working label Aug 10, 2023
@arunmk arunmk self-assigned this Aug 10, 2023
@arunmk
Copy link
Collaborator

arunmk commented Aug 10, 2023

The specific fix as root-caused by @TimTinneveld is to have the following in the cloud-init file:

arping -U -I ens192 <VM_SELF_IP_address> -c 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants