Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

Closed
pwais opened this issue Sep 16, 2023 · 4 comments
Closed

gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

pwais opened this issue Sep 16, 2023 · 4 comments

Comments

@pwais
Copy link

pwais commented Sep 16, 2023

1. Issue or feature description

I'm trying to use gpu-feature-discovery with nvdp. It just wont work :(

2. Steps to reproduce the issue

I install nvdp with gfd enabled as described in the README in this repo:

helm upgrade -i nvdp nvdp/nvidia-device-plugin     --version=0.14.1     --debug --reset-values     --namespace nvidia-device-plugin     --create-namespace     --set gfd.enabled=true

But the pods fail with errors like:

W0916 07:32:42.481179       1 component.go:41] [core][Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
  "Addr": "nvdp-node-feature-discovery-master:8080",
  "ServerName": "nvdp-node-feature-discovery-master:8080",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.xx.xxx.xxx:8080: connect: connection refused"

When I instead deploy without gfd:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.14.1 \
    --debug \
    --namespace nvidia-device-plugin \
    --create-namespace 

nvdp starts up fine (pods show up) and I get number of GPU annotations. But of course none of the labels that gfd provides.

Prior to installing nvdp I installed node discovery manually via helm upgrade -i nfd nfd/node-feature-discovery --namespace nfd --create-namespace from https://kubernetes-sigs.github.io/node-feature-discovery/charts

@klueska
Copy link
Contributor

klueska commented Sep 16, 2023

If you pre-deploy node-feature-discovery, then you will need to disable its auto-deployment via the device-plugin (otherwise it will likely cause conflicts):

i.e.

helm upgrade -i nvdp nvdp/nvidia-device-plugin     \
    --version=0.14.1     \
    --debug --reset-values     \
    --namespace nvidia-device-plugin     \
    --create-namespace     \
    --set gfd.enabled=true.    \
    --set nfd.enabled=false

That said, the version of node-feature-discovery that is launched automatically by the device-plugin is the only one it is tested against, so pre-deploying your own may cause other unforeseen issues.

@pwais
Copy link
Author

pwais commented Sep 16, 2023 via email

@elezar
Copy link
Member

elezar commented Sep 18, 2023

@ArangoGutierrez as far as I was aware the Note feature API was disabled by default. Do you have any ideas on what's going on here?

@pwais
Copy link
Author

pwais commented Sep 20, 2023

alright so upon further debugging, it looks like I had fallen into a relatively common networking / flannel issue with k8s 1.25 where pod IPs work but service IPs do not. The nvidia device plugin appears to set up node labels e.g. gpu count somehow without using a service, perhaps directly to the kube api? but the gpu device discovery needs a service which wasn't working.

Using k8s 1.28.x and NOT using self-installed node-feature-discovery appears to work as expected.

so this work, WITHOUT doing a separate nfd/node-feature-discovery install (which still breaks gpu-discovery it seems, but im happy not doing the separate install):

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.14.1 \
    --debug \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set gfd.enabled=true

@pwais pwais closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants