gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

pwais · 2023-09-16T07:40:08Z

1. Issue or feature description

I'm trying to use gpu-feature-discovery with nvdp. It just wont work :(

2. Steps to reproduce the issue

I install nvdp with gfd enabled as described in the README in this repo:

helm upgrade -i nvdp nvdp/nvidia-device-plugin     --version=0.14.1     --debug --reset-values     --namespace nvidia-device-plugin     --create-namespace     --set gfd.enabled=true

But the pods fail with errors like:

W0916 07:32:42.481179       1 component.go:41] [core][Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
  "Addr": "nvdp-node-feature-discovery-master:8080",
  "ServerName": "nvdp-node-feature-discovery-master:8080",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.xx.xxx.xxx:8080: connect: connection refused"

When I instead deploy without gfd:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.14.1 \
    --debug \
    --namespace nvidia-device-plugin \
    --create-namespace

nvdp starts up fine (pods show up) and I get number of GPU annotations. But of course none of the labels that gfd provides.

Prior to installing nvdp I installed node discovery manually via helm upgrade -i nfd nfd/node-feature-discovery --namespace nfd --create-namespace from https://kubernetes-sigs.github.io/node-feature-discovery/charts

The text was updated successfully, but these errors were encountered:

klueska · 2023-09-16T09:51:22Z

If you pre-deploy node-feature-discovery, then you will need to disable its auto-deployment via the device-plugin (otherwise it will likely cause conflicts):

i.e.

helm upgrade -i nvdp nvdp/nvidia-device-plugin     \
    --version=0.14.1     \
    --debug --reset-values     \
    --namespace nvidia-device-plugin     \
    --create-namespace     \
    --set gfd.enabled=true.    \
    --set nfd.enabled=false

That said, the version of node-feature-discovery that is launched automatically by the device-plugin is the only one it is tested against, so pre-deploying your own may cause other unforeseen issues.

pwais · 2023-09-16T10:29:41Z

Thanks for the quick response! yes i also tried with nfd.enabled=false and then nothing gets created; no pods nothing. I also tried removing nfd and only using nvdp without the enabled=false and I get a similar connection error as noted previously.

…

On Sat, Sep 16, 2023 at 02:51 Kevin Klues ***@***.***> wrote: If you pre-deploy node-feature-discovery, then you will need to disable its auto-deployment via the device-plugin (otherwise it will likely cause conflicts): i.e. helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.14.1 \ --debug --reset-values \ --namespace nvidia-device-plugin \ --create-namespace \ --set gfd.enabled=true --set nfd.enabled=false That said, the version of node-feature-discovery that is launched automatically by the device-plugin is the only one it is tested against, so pre-deploying your own may cause other unforeseen issues. — Reply to this email directly, view it on GitHub <#438 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABVIINJK4HX5ZGR7JT5XITX2VZCJANCNFSM6AAAAAA42WZMSA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

elezar · 2023-09-18T08:18:35Z

@ArangoGutierrez as far as I was aware the Note feature API was disabled by default. Do you have any ideas on what's going on here?

pwais · 2023-09-20T00:02:04Z

alright so upon further debugging, it looks like I had fallen into a relatively common networking / flannel issue with k8s 1.25 where pod IPs work but service IPs do not. The nvidia device plugin appears to set up node labels e.g. gpu count somehow without using a service, perhaps directly to the kube api? but the gpu device discovery needs a service which wasn't working.

Using k8s 1.28.x and NOT using self-installed node-feature-discovery appears to work as expected.

so this work, WITHOUT doing a separate nfd/node-feature-discovery install (which still breaks gpu-discovery it seems, but im happy not doing the separate install):

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.14.1 \
    --debug \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set gfd.enabled=true

pwais closed this as completed Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

pwais commented Sep 16, 2023

klueska commented Sep 16, 2023 •

edited

Loading

pwais commented Sep 16, 2023 via email

elezar commented Sep 18, 2023

pwais commented Sep 20, 2023

gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

gfd gpu feature discovery appears to be broken with nvdp 0.14.1 #438

Comments

pwais commented Sep 16, 2023

1. Issue or feature description

2. Steps to reproduce the issue

klueska commented Sep 16, 2023 • edited Loading

pwais commented Sep 16, 2023 via email

elezar commented Sep 18, 2023

pwais commented Sep 20, 2023

klueska commented Sep 16, 2023 •

edited

Loading