-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calico cni plugin renders new node unusable #70
Comments
I tried to reproduce the problem by creating a new cluster, creating a new node or by restarting an underlying vm and so far I had no success. The logs show that the cni plugin was installed but since I couldn't reproduce and check it I'am are not sure if it was indeed there or got somehow deleted afterwards. Any help to reproduce the problem would be appreciated. The |
We saw the problem on another cluster on one node. The install cni container logs show that all binaries were installed.
However if we look at the install-cni-check container, we see that the calico and calico-ipam binary are missing.
It somehow gets removed directly after copying. So far we couldn't conclude what is causing this in the install function. |
A separate PR was opened by @ScheererJ to ease the installation of the calico cni plugins. |
/reopen |
I reopened because it's not yet clear whether #88 indeed fixes the problem. Two alternative approaches that could be considered as follow-ups (just mitigation to ease the lives of human operators):
/assign @ScheererJ @DockToFuture |
Well, the way this might happen is the following:
|
Indeed the cni calico copy logic is not fool-proof. However, in all occurrences of this issue we always saw the log output of the copy loop in the init container logs. Therefore, the assumption was that the copy loop executed and did not abort earlier. |
While running integration tests, I see this errors on multiple shoots similar to this one (both azure and aws):
All shoot pods are affected and this seems to lead to the test eventually failing (in a different phase with a different message). |
How to categorize this issue?
If multiple identifiers make sense you can also state the commands multiple times, e.g.
/area networking
/kind bug
/priority normal
What happened:
This ticket originates from a slack discussion
When we provision a new cluster, sporadically the pods on a node are stuck in state
Container Creating
. There are events sayingPod sandbox changed, it will be killed and re-created.
over and over.In the logs I can see the calico CNI plugin getting installed:
But according to the slack discussion something must have removed it later so that the error appears.
What you expected to happen:
A regular node being provisioned where pods can run.
How to reproduce it (as minimally and precisely as possible):
Unfortunately, I do not know how to reproduce this. It is sporadic
Anything else we need to know?:
Environment:
kubectl version
): 1.17.14The text was updated successfully, but these errors were encountered: