Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robot server is tainted and does not initialise because of FailedToCreateRoute error #796

Open
Taronyuu opened this issue Nov 20, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Taronyuu
Copy link

Taronyuu commented Nov 20, 2024

TL;DR

Adding a robot server to my existing K3S cluster with HCCM results in my node not initialising because of HCCM trying to create a route which is not supported.

Expected behavior

I would expect the new robot server to become a part of the cluster by having the route registered in the private network or not break the initilization of the node. And I would also expect the IP address of the robot server to be added to the Load Balancer.

Observed behavior

The route is not being added and fails with a error:

  Warning  FailedToCreateRoute             47m                 route_controller  Could not create route 9e195a31-799d-46aa-8e2c-1d6299998e95 10.42.26.0/24 for node k3s-n2-fsn1-duv after 58.88µs: hcloud/CreateRoute: hcops/AllServersCache.ByName: k3s-n2-fsn1-duv hcops/AllServersCache.getCache: not found

Because of this taints are added and the server is never initialised:

  taints:
    - key: node.cloudprovider.kubernetes.io/uninitialized
      value: 'true'
      effect: NoSchedule
    - key: node.kubernetes.io/network-unavailable
      effect: NoSchedule
      timeAdded: '2024-11-20T00:33:26Z'

The node is not added to my Load Balancer.

Minimal working example

Install command for my robot server.

curl -sfL https://get.k3s.io | \
INSTALL_K3S_EXEC="agent \
--node-ip=10.199.0.2 \
--node-name=k3s-n2-fsn1-duv \
--flannel-iface=enp5s0.4000 \
--kubelet-arg=cloud-provider=external \
--kubelet-arg=volume-plugin-dir=/var/lib/kubelet/volumeplugins \
--kubelet-arg=kube-reserved=cpu=500m,memory=1000Mi,ephemeral-storage=1Gi \
--kubelet-arg=system-reserved=cpu=500m,memory=1000Mi \
--kubelet-arg=root-dir=/var/lib/kubelet \
--node-label=k3s_upgrade=true \
--node-label=node.kubernetes.io/server-swap=enabled \
--node-label=node.kubernetes.io/exclude-from-external-load-balancers=true \
--node-label=instance.hetzner.cloud/provided-by=robot \
--node-label=instance.hetzner.cloud/is-root-server=true \
--kubelet-arg=provider-id=hrobot://2493173 \
--selinux" \
K3S_URL="https://10.255.0.101:6443" \
K3S_TOKEN="<REDACTED>" \
sh -

Log output

Normal   RegisteredNode                  48m                 node-controller   Node k3s-n2-fsn1-duv event: Registered Node k3s-n2-fsn1-duv in Controller
  Warning  FailedToCreateRoute             48m                 route_controller  Could not create route 9e195a31-799d-46aa-8e2c-1d6299998e95 10.42.26.0/24 for node k3s-n2-fsn1-duv after 94.44µs: hcloud/CreateRoute: hcops/AllServersCache.ByName: k3s-n2-fsn1-duv hcops/AllServersCache.getCache: not found
  Warning  FailedToCreateRoute             47m                 route_controller  Could not create route 9e195a31-799d-46aa-8e2c-1d6299998e95 10.42.26.0/24 for node k3s-n2-fsn1-duv after 68.241µs: hcloud/CreateRoute: hcops/AllServersCache.ByName: k3s-n2-fsn1-duv hcops/AllServersCache.getCache: not found
  Warning  FailedToCreateRoute             47m                 route_controller  Could not create route 9e195a31-799d-46aa-8e2c-1d6299998e95 10.42.26.0/24 for node k3s-n2-fsn1-duv after 58.88µs: hcloud/CreateRoute: hcops/AllServersCache.ByName: k3s-n2-fsn1-duv hcops/AllServersCache.getCache: not found

Additional information

I have added a vSwitch to my robot server and configured it:
CleanShot 2024-11-20 at 01 21 46@2x

I have added the vSwitch to my private network:
CleanShot 2024-11-20 at 01 22 16@2x

I can confirm that pinning to any node in my existing cluster from my robot server and the other way works as expected. So the connection is definitely there.

Then I've installed K3S using the command above. I've added both the old and the new labels of provided-by=robot to be sure that the CSI managed is ignoring this robot server. The documentation of the CSI driver says: "If you are using the hcloud-cloud-controller-manager version 1.21.0 or later, these labels are added automatically. Otherwise, you will need to label the nodes manually." (https://github.com/hetznercloud/csi-driver/blob/main/docs/kubernetes/README.md#integration-with-root-servers) This did not happen, but as far as I can see there is no future update which makes me think 1.21.0 is not released yet. Possibly unrelated, maybe not. Sharing it just in case!

Now, I have HCCM installed according to the robot.md documentation including a hcloud secret:

kubectl get secret hcloud -n kube-system  -o yaml
apiVersion: v1
data:
  network: azNz
  robot-password: <REDACTED>
  robot-user: <REDACTED>
  token: <REDACTED>
kind: Secret
type: Opaque

After installing the node the following taints are added:

  taints:
    - key: node.cloudprovider.kubernetes.io/uninitialized
      value: 'true'
      effect: NoSchedule
    - key: node.kubernetes.io/network-unavailable
      effect: NoSchedule
      timeAdded: '2024-11-20T00:33:26Z'

As far as I can see, these taints are added because HCCM is failing as can be see in the logs:

 00:39:26.634209       1 route_controller.go:216] action for Node "k3s-n2-fsn1-duv" with CIDR "10.42.27.0/24": "add"
I1120 00:39:26.634231       1 route_controller.go:290] route spec to be created: &{ k3s-n2-fsn1-duv false [{InternalIP 10.199.0.2} {Hostname k3s-n2-fsn1-duv}] 10.42.27.0/24 false}
I1120 00:39:26.634271       1 route_controller.go:304] Creating route for node k3s-n2-fsn1-duv 10.42.27.0/24 with hint 2e410eb4-be80-43ec-8cab-3275a9c2ae1f, throttled 15µs
E1120 00:39:26.634314       1 route_controller.go:329] Could not create route 2e410eb4-be80-43ec-8cab-3275a9c2ae1f 10.42.27.0/24 for node k3s-n2-fsn1-duv: hcloud/CreateRoute: hcops/AllServersCache.ByName: k3s-n2-fsn1-duv hcops/AllServersCache.getCache: not found
I1120 00:39:26.634511       1 event.go:389] "Event occurred" object="k3s-n2-fsn1-duv" fieldPath="" kind="Node" apiVersion="" type="Warning" reason="FailedToCreateRoute" message="Could not create route 2e410eb4-be80-43ec-8cab-3275a9c2ae1f 10.42.27.0/24 for node k3s-n2-fsn1-duv after 45.64µs: hcloud/CreateRoute: hcops/AllServersCache.ByName: k3s-n2-fsn1-duv hcops/AllServersCache.getCache: not found"

I did read in the docs that routes & private networks are not possible. (https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/robot.md#routes--private-networks) That is okay because I can do this manually.

However, the problem is, that by not being able to do that the unavailable taint is being added and (presumably) because of that my node is never added to the Load Balancer.

I've got a few questions:

  1. Is it correct that 1.21.0 of HCCM is not released yet?
  2. Is it correct that there is no way to automatically add my robot server into the private network using HCCM?
  3. If yes, is it correct that I will manually have to do this?
  4. Is it correct that by not adding the route, the manager will fail and never add the robot server to the load balancer?
  5. If yes, how can I solve this so my node will become available and part of the load balancer.
  6. If no, how can I still solve it and is it correct that I will have to add my node manually to the LB?
@Taronyuu Taronyuu added the bug Something isn't working label Nov 20, 2024
@Taronyuu
Copy link
Author

I spent a bit more time and realised that my HCCM was installed without the ROBOT_ENABLED variable. I added it and ran into the issue that said robot servers and routes are not supported. However, I still use cloud servers too and therefore cannot disable networking.

Is there a way to have both robot servers with no networking and cloud servers with networking?

@apricote
Copy link
Member

I added it and ran into the issue that said robot servers and routes are not supported.

That is correct. It is not that easy to handle the Layer 2 routes of Robot servers and vSwitches in HCCM.

However, I still use cloud servers too and therefore cannot disable networking.
Is there a way to have both robot servers with no networking and cloud servers with networking?

There are two parts to the networking:

  1. Providing private IP addresses and adding Load Balancers to the private network.
  2. Utilizing Cloud Routes to route traffic for Pod CIDRs.

Many people have the routing functionality active (as it is the default when a network is specified) but do not rely on it. This functionality is usually provided by your CNI, and unless you have explicitly disabled this in your CNI (ie. through routingMode: native in Cilium) the routes created by HCCM are not used.

I would recommend you to check if you are actually using the cloud routes and disable the routes in HCCM otherwise.

You can disable the routes controller by setting HCLOUD_NETWORK_ROUTES_ENABLED=false or by adding the flag --controllers=-node-route-controller.

@Taronyuu
Copy link
Author

@apricote Thank you for your reply!

I did research the HCLOUD_NETWORK_ROUTES_ENABLED environment, but by setting that I disable all routes for all servers, correct? Is there an option to only disable network routes for robot servers but keep the cloud servers?

I'm currently have 10 cloud servers as nodes and I would like to migrate to 3 robot servers (mostly because of storage space and Longhorn and raw performance). Eventually I can disable network routes all together because there won't be any cloud servers anymore, but before that I would like to have both cloud and robot servers working.

@apricote
Copy link
Member

but by setting that I disable all routes for all servers, correct?

Yes, that will stop any routes from being updated (though existing routes will not be cleaned up).

Are you sure that your setup requires the Routes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants