-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server is not restarted after restarting kubelet #79
Comments
What is the K8s version you are using ? can you provide device plugin logs ? |
Hi @adrianchiris Q: does the following path exists in your system:
Q: can you provide device plugin logs ?
Then I manually restart kubelet
The logs are the following. The server is not restarted, and
|
ack, so it will use the old way to register with kubelet and write resource sockets. please check #82 it should solve the issue. |
Hi @adrianchiris Thank you very much for helping fix this issue, the pr is merged, could we have a new release ? |
v1.4.0 is out please check :) |
Hi, We deploy k8s-rdma-shared-dev-plugin (
artprod.dev.bloomberg.com/ds/yweng14/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.3.2
) on our clusters and find that the the socket (ib.sock) is not recreated. We print some debugging message in theRestart()
function and find it is blocked atrs.stop <- true
In our log we also see
I think it is blocking here
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/master/pkg/resources/server.go#L294-L299
When kubelet is restarted, the context is closed. Then the ListAndWatch print out message and return nil and thus stop channel is blocked.
We remove the context check (line 294 - 299), restart is not blocked and ib.sock is recreated. We also check how Nvidia GPU Device plugin implements ListAndWatch, they don't check context.Done()
I think this issue #74 may be also related this issue.
The text was updated successfully, but these errors were encountered: