Calico cni plugin renders new node unusable #70

lenhard · 2021-02-18T14:04:08Z

How to categorize this issue?

If multiple identifiers make sense you can also state the commands multiple times, e.g.

/area networking
/kind bug
/priority normal

What happened:
This ticket originates from a slack discussion

When we provision a new cluster, sporadically the pods on a node are stuck in state Container Creating. There are events saying Pod sandbox changed, it will be killed and re-created. over and over.

In the logs I can see the calico CNI plugin getting installed:

time="2021-02-17T10:14:40Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-02-17T10:14:40Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-02-17T10:14:41Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-02-17T10:14:41Z" level=info msg="CNI plugin version: v3.17.1\n"
time="2021-02-17T10:14:41Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-02-17T10:14:41Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG environment variable." source="install.go:319"
time="2021-02-17T10:14:41Z" level=info msg="Created /host/etc/cni/net.d/10-calico.conflist"
time="2021-02-17T10:14:41Z" level=info msg="Done configuring CNI.  Sleep= false"

But according to the slack discussion something must have removed it later so that the error appears.

What you expected to happen:
A regular node being provisioned where pods can run.

How to reproduce it (as minimally and precisely as possible):
Unfortunately, I do not know how to reproduce this. It is sporadic

Anything else we need to know?:

Environment:

Gardener version (if relevant):
Extension version:
Kubernetes version (use kubectl version): 1.17.14
Cloud provider or hardware configuration: Azure
Others:

The text was updated successfully, but these errors were encountered:

DockToFuture · 2021-02-22T07:47:33Z

I tried to reproduce the problem by creating a new cluster, creating a new node or by restarting an underlying vm and so far I had no success. The logs show that the cni plugin was installed but since I couldn't reproduce and check it I'am are not sure if it was indeed there or got somehow deleted afterwards.

Any help to reproduce the problem would be appreciated.

The install-cni init container is now also run in privileged mode: #68

DockToFuture · 2021-05-06T13:56:02Z

We saw the problem on another cluster on one node. The install cni container logs show that all binaries were installed.

g ks logs calico-node-wb6fh -c install-cni
time="2021-05-06T05:06:07Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-05-06T05:06:08Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-05-06T05:06:08Z" level=warning msg="Failed getting CNI plugin version" error="fork/exec /host/opt/cni/bin/calico: no such file or directory"
time="2021-05-06T05:06:08Z" level=info msg="CNI plugin version: "
time="2021-05-06T05:06:08Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-05-06T05:06:08Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG environment variable." source="install.go:319"
time="2021-05-06T05:06:08Z" level=info msg="Created /host/etc/cni/net.d/10-calico.conflist"
time="2021-05-06T05:06:08Z" level=info msg="Done configuring CNI.  Sleep= false"

However if we look at the install-cni-check container, we see that the calico and calico-ipam binary are missing.

drwxr-xr-x 2 0 0     4096 May  6 05:06 .
drwxr-xr-x 3 0 0     4096 May  6 05:06 ..
-rwxr-xr-x 1 0 0  4159518 May  6 05:06 bandwidth
-rwxr-xr-x 1 0 0  3069556 May  6 05:06 flannel
-rwxr-xr-x 1 0 0  3614480 May  6 05:06 host-local
-rwxr-xr-x 1 0 0 36564992 May  6 05:06 install
-rwxr-xr-x 1 0 0  3209463 May  6 05:06 loopback
-rwxr-xr-x 1 0 0  3939867 May  6 05:06 portmap
-rwxr-xr-x 1 0 0  3356587 May  6 05:06 tuning

It somehow gets removed directly after copying. So far we couldn't conclude what is causing this in the install function.

DockToFuture · 2021-05-17T07:12:52Z

A separate PR was opened by @ScheererJ to ease the installation of the calico cni plugins.

rfranzke · 2021-06-07T14:16:28Z

/reopen

rfranzke · 2021-06-07T14:20:19Z

I reopened because it's not yet clear whether #88 indeed fixes the problem.

Two alternative approaches that could be considered as follow-ups (just mitigation to ease the lives of human operators):

Under the assumption that the issue only occurs after the initialization phase of calico (i.e., only directly after the install-cni init container ran): Enhance the install-cni init container by checking if the binaries exist on the host. This would make the kubelet to restart the container automatically if the binaries are gone for whatever reason.
Otherwise, i.e., if the issue can also occur during runtime, then another sidecar container could be added to the calico-node pod which regularly checks if the binaries still exist. If not then it could delete itself/its own pod, enforcing the install-cni init container to run again.

/assign @ScheererJ @DockToFuture

marwinski · 2021-06-30T12:50:05Z

Well, the way this might happen is the following:

The script runs successfully for the first time, everything is installed
The script runs a second time. This code here https://github.com/projectcalico/cni-plugin/blob/0bc7508a82b52ce48d476695b68b9b8fc06ab0a4/pkg/install/install.go#L120 removes the named binaries. After that there are numerous places where it might fail and exit leaving the situation as described above.

ScheererJ · 2021-06-30T14:47:59Z

Well, the way this might happen is the following:

The script runs successfully for the first time, everything is installed

The script runs a second time. This code here https://github.com/projectcalico/cni-plugin/blob/0bc7508a82b52ce48d476695b68b9b8fc06ab0a4/pkg/install/install.go#L120 removes the named binaries. After that there are numerous places where it might fail and exit leaving the situation as described above.

Indeed the cni calico copy logic is not fool-proof. However, in all occurrences of this issue we always saw the log output of the copy loop in the init container logs. Therefore, the assumption was that the copy loop executed and did not abort earlier.

stoyanr · 2021-09-08T09:31:06Z

While running integration tests, I see this errors on multiple shoots similar to this one (both azure and aws):

{"level":"info","msg":"At 2021-09-08 05:22:14 +0000 UTC - event for dashboard-metrics-scraper-5c8446f57c-n8plt: {kubelet ip-10-250-0-82.eu-west-1.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container \"46633225d2eee66d31735c451a927ecee9b926bd8371584a6e6f6f17f5683c52\" network for pod \"dashboard-metrics-scraper-5c8446f57c-n8plt\": networkPlugin cni failed to set up pod \"dashboard-metrics-scraper-5c8446f57c-n8plt_kubernetes-dashboard\" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/","time":"2021-09-08T06:06:23Z"}

All shoot pods are affected and this seems to lead to the test eventually failing (in a different phase with a different message). /var/lib/calico/nodename: no such file or directory looks like it might be related to the error described in this ticket.
The calico extension version is v1.19.0.

lenhard added the kind/bug Bug label Feb 18, 2021

gardener-robot added area/networking Networking related priority/normal labels Feb 18, 2021

gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/normal labels Mar 8, 2021

ScheererJ mentioned this issue Apr 12, 2021

Added simple init container to calico-node to verify that the cni installation step worked. #78

Merged

ScheererJ mentioned this issue Jun 7, 2021

Only use secondary bin dir during cni installation #88

Merged

DockToFuture closed this as completed in #88 Jun 7, 2021

gardener-robot reopened this Jun 7, 2021

gardener-robot assigned DockToFuture and ScheererJ Jun 7, 2021

vpnachev mentioned this issue Sep 13, 2021

Automatically restart calico if vpn-shoot fails to start gardener/gardener#4649

Closed

DockToFuture mentioned this issue Oct 14, 2021

remove install cni check container #117

Merged

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Mar 8, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico cni plugin renders new node unusable #70

Calico cni plugin renders new node unusable #70

lenhard commented Feb 18, 2021

DockToFuture commented Feb 22, 2021

DockToFuture commented May 6, 2021 •

edited

Loading

DockToFuture commented May 17, 2021

rfranzke commented Jun 7, 2021

rfranzke commented Jun 7, 2021

marwinski commented Jun 30, 2021

ScheererJ commented Jun 30, 2021

stoyanr commented Sep 8, 2021

Calico cni plugin renders new node unusable #70

Calico cni plugin renders new node unusable #70

Comments

lenhard commented Feb 18, 2021

DockToFuture commented Feb 22, 2021

DockToFuture commented May 6, 2021 • edited Loading

DockToFuture commented May 17, 2021

rfranzke commented Jun 7, 2021

rfranzke commented Jun 7, 2021

marwinski commented Jun 30, 2021

ScheererJ commented Jun 30, 2021

stoyanr commented Sep 8, 2021

DockToFuture commented May 6, 2021 •

edited

Loading