Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico cni plugin renders new node unusable #70

Open
lenhard opened this issue Feb 18, 2021 · 8 comments · Fixed by #88
Open

Calico cni plugin renders new node unusable #70

lenhard opened this issue Feb 18, 2021 · 8 comments · Fixed by #88
Assignees
Labels
area/networking Networking related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/3 Priority (lower number equals higher priority)

Comments

@lenhard
Copy link

lenhard commented Feb 18, 2021

How to categorize this issue?

If multiple identifiers make sense you can also state the commands multiple times, e.g.

/area networking
/kind bug
/priority normal

What happened:
This ticket originates from a slack discussion

When we provision a new cluster, sporadically the pods on a node are stuck in state Container Creating. There are events saying Pod sandbox changed, it will be killed and re-created. over and over.

In the logs I can see the calico CNI plugin getting installed:

time="2021-02-17T10:14:40Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-02-17T10:14:40Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-02-17T10:14:41Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-02-17T10:14:41Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-02-17T10:14:41Z" level=info msg="CNI plugin version: v3.17.1\n"
time="2021-02-17T10:14:41Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-02-17T10:14:41Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG environment variable." source="install.go:319"
time="2021-02-17T10:14:41Z" level=info msg="Created /host/etc/cni/net.d/10-calico.conflist"
time="2021-02-17T10:14:41Z" level=info msg="Done configuring CNI.  Sleep= false"

But according to the slack discussion something must have removed it later so that the error appears.

What you expected to happen:
A regular node being provisioned where pods can run.

How to reproduce it (as minimally and precisely as possible):
Unfortunately, I do not know how to reproduce this. It is sporadic

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version): 1.17.14
  • Cloud provider or hardware configuration: Azure
  • Others:
@DockToFuture
Copy link
Member

I tried to reproduce the problem by creating a new cluster, creating a new node or by restarting an underlying vm and so far I had no success. The logs show that the cni plugin was installed but since I couldn't reproduce and check it I'am are not sure if it was indeed there or got somehow deleted afterwards.

Any help to reproduce the problem would be appreciated.

The install-cni init container is now also run in privileged mode: #68

@DockToFuture
Copy link
Member

DockToFuture commented May 6, 2021

We saw the problem on another cluster on one node. The install cni container logs show that all binaries were installed.

g ks logs calico-node-wb6fh -c install-cni
time="2021-05-06T05:06:07Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-05-06T05:06:07Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-05-06T05:06:08Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-05-06T05:06:08Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-05-06T05:06:08Z" level=warning msg="Failed getting CNI plugin version" error="fork/exec /host/opt/cni/bin/calico: no such file or directory"
time="2021-05-06T05:06:08Z" level=info msg="CNI plugin version: "
time="2021-05-06T05:06:08Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-05-06T05:06:08Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG environment variable." source="install.go:319"
time="2021-05-06T05:06:08Z" level=info msg="Created /host/etc/cni/net.d/10-calico.conflist"
time="2021-05-06T05:06:08Z" level=info msg="Done configuring CNI.  Sleep= false"

However if we look at the install-cni-check container, we see that the calico and calico-ipam binary are missing.

drwxr-xr-x 2 0 0     4096 May  6 05:06 .
drwxr-xr-x 3 0 0     4096 May  6 05:06 ..
-rwxr-xr-x 1 0 0  4159518 May  6 05:06 bandwidth
-rwxr-xr-x 1 0 0  3069556 May  6 05:06 flannel
-rwxr-xr-x 1 0 0  3614480 May  6 05:06 host-local
-rwxr-xr-x 1 0 0 36564992 May  6 05:06 install
-rwxr-xr-x 1 0 0  3209463 May  6 05:06 loopback
-rwxr-xr-x 1 0 0  3939867 May  6 05:06 portmap
-rwxr-xr-x 1 0 0  3356587 May  6 05:06 tuning

It somehow gets removed directly after copying. So far we couldn't conclude what is causing this in the install function.

@DockToFuture
Copy link
Member

A separate PR was opened by @ScheererJ to ease the installation of the calico cni plugins.

@rfranzke
Copy link
Member

rfranzke commented Jun 7, 2021

/reopen

@gardener-robot gardener-robot reopened this Jun 7, 2021
@rfranzke
Copy link
Member

rfranzke commented Jun 7, 2021

I reopened because it's not yet clear whether #88 indeed fixes the problem.

Two alternative approaches that could be considered as follow-ups (just mitigation to ease the lives of human operators):

  1. Under the assumption that the issue only occurs after the initialization phase of calico (i.e., only directly after the install-cni init container ran): Enhance the install-cni init container by checking if the binaries exist on the host. This would make the kubelet to restart the container automatically if the binaries are gone for whatever reason.
  2. Otherwise, i.e., if the issue can also occur during runtime, then another sidecar container could be added to the calico-node pod which regularly checks if the binaries still exist. If not then it could delete itself/its own pod, enforcing the install-cni init container to run again.

/assign @ScheererJ @DockToFuture

@marwinski
Copy link
Contributor

Well, the way this might happen is the following:

@ScheererJ
Copy link
Member

Well, the way this might happen is the following:

Indeed the cni calico copy logic is not fool-proof. However, in all occurrences of this issue we always saw the log output of the copy loop in the init container logs. Therefore, the assumption was that the copy loop executed and did not abort earlier.

@stoyanr
Copy link
Contributor

stoyanr commented Sep 8, 2021

While running integration tests, I see this errors on multiple shoots similar to this one (both azure and aws):

{"level":"info","msg":"At 2021-09-08 05:22:14 +0000 UTC - event for dashboard-metrics-scraper-5c8446f57c-n8plt: {kubelet ip-10-250-0-82.eu-west-1.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container \"46633225d2eee66d31735c451a927ecee9b926bd8371584a6e6f6f17f5683c52\" network for pod \"dashboard-metrics-scraper-5c8446f57c-n8plt\": networkPlugin cni failed to set up pod \"dashboard-metrics-scraper-5c8446f57c-n8plt_kubernetes-dashboard\" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/","time":"2021-09-08T06:06:23Z"}

All shoot pods are affected and this seems to lead to the test eventually failing (in a different phase with a different message). /var/lib/calico/nodename: no such file or directory looks like it might be related to the error described in this ticket.
The calico extension version is v1.19.0.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Mar 8, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants