a new created node after updating is not able to reach the t1 edge. #504

TimTinneveld · 2023-08-10T11:06:23Z

Describe the bug

After redeployment/updates of controlplanes or workers, new created vm's get directly the same ip-address of the deleted old vm (because that is the first free ip in the pool). This can create some issues because most of the time the given templates from vmware don't send any arp requests that they are using the already used ip. NSX-T by default discovers new ip's after a time-out from 8 minutes (the arp table in nsx-t doesn't get directly updated). because of the arp issues the vm cannot ping the t1 router, because of this it can also not reach the api and join the cluster.

So far i have been able to activate the retry join option to the cluster, but this makes the vm creation time arround 10 minutes.
The second solution was to add a arping after booting the vm. now vm creation times are arround 2-3 minutes what is acceptable for me.

i noticed that after running the command: “arping -U -I ens192 <VM_SELF_IP_address> -c 3" the tier 1 router becomes responsive. As a fix i have added this command in the template. this brings the creation time to a steady 2-3 minutes all the time.

Reproduction steps

updating nodes to a new image
after one node is deleted and a new node is being created with the same ip as the deleted node.
the new node cannot reach the tier 1 gateway. because of this it cannot join the cluster.
...

Expected behavior

The new created node should be able to reach the gateway directly and lower down the creation time.

Additional context

The error that can be seen when loadbalancer is not reachable: “[preflight] Running pre-flight checks
[2023-05-26 14:59:39] error execution phase preflight: couldn’t validate the identity of the API Server: Get “https://****:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s”: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
[2023-05-26 14:59:39] To see the stack trace of this error execute with --v=5 or higher”

arunmk · 2023-08-10T16:30:20Z

The specific fix as root-caused by @TimTinneveld is to have the following in the cloud-init file:

arping -U -I ens192 <VM_SELF_IP_address> -c 3

TimTinneveld added the bug Something isn't working label Aug 10, 2023

arunmk self-assigned this Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a new created node after updating is not able to reach the t1 edge. #504

a new created node after updating is not able to reach the t1 edge. #504

TimTinneveld commented Aug 10, 2023

arunmk commented Aug 10, 2023

a new created node after updating is not able to reach the t1 edge. #504

a new created node after updating is not able to reach the t1 edge. #504

Comments

TimTinneveld commented Aug 10, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

arunmk commented Aug 10, 2023