-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438
Comments
If you pre-deploy i.e.
That said, the version of |
Thanks for the quick response! yes i also tried with nfd.enabled=false and
then nothing gets created; no pods nothing. I also tried removing nfd and
only using nvdp without the enabled=false and I get a similar connection
error as noted previously.
…On Sat, Sep 16, 2023 at 02:51 Kevin Klues ***@***.***> wrote:
If you pre-deploy node-feature-discovery, then you will need to disable
its auto-deployment via the device-plugin (otherwise it will likely cause
conflicts):
i.e.
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.14.1 \
--debug --reset-values \
--namespace nvidia-device-plugin \
--create-namespace \
--set gfd.enabled=true
--set nfd.enabled=false
That said, the version of node-feature-discovery that is launched
automatically by the device-plugin is the only one it is tested against,
so pre-deploying your own may cause other unforeseen issues.
—
Reply to this email directly, view it on GitHub
<#438 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABVIINJK4HX5ZGR7JT5XITX2VZCJANCNFSM6AAAAAA42WZMSA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@ArangoGutierrez as far as I was aware the Note feature API was disabled by default. Do you have any ideas on what's going on here? |
alright so upon further debugging, it looks like I had fallen into a relatively common networking / flannel issue with k8s 1.25 where pod IPs work but service IPs do not. The nvidia device plugin appears to set up node labels e.g. gpu count somehow without using a service, perhaps directly to the kube api? but the gpu device discovery needs a service which wasn't working. Using k8s 1.28.x and NOT using self-installed so this work, WITHOUT doing a separate helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.14.1 \
--debug \
--namespace nvidia-device-plugin \
--create-namespace \
--set gfd.enabled=true |
1. Issue or feature description
I'm trying to use
gpu-feature-discovery
with nvdp. It just wont work :(2. Steps to reproduce the issue
I install nvdp with gfd enabled as described in the README in this repo:
But the pods fail with errors like:
When I instead deploy without gfd:
nvdp starts up fine (pods show up) and I get number of GPU annotations. But of course none of the labels that gfd provides.
Prior to installing nvdp I installed node discovery manually via
helm upgrade -i nfd nfd/node-feature-discovery --namespace nfd --create-namespace
from https://kubernetes-sigs.github.io/node-feature-discovery/chartsThe text was updated successfully, but these errors were encountered: