Failed to communicate with NVIDIA driver. See more details in the logs #362

d-kazantsev · 2024-12-04T14:51:14Z

I deployed hardware-observer on the node with NVIDIA GPU but it is going to be used in PCI passthrough mode to be attached directly to the VM. When I check hardware-observer status I see this message " Failed to communicate with NVIDIA driver. See more details in the logs"

Logs indicated error in snap.dcgm.dcgm-exporter.service starting:

systemctl status snap.dcgm.dcgm-exporter.service
○ snap.dcgm.dcgm-exporter.service - Service for snap application dcgm.dcgm-exporter
Loaded: loaded (/etc/systemd/system/snap.dcgm.dcgm-exporter.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2024-12-02 23:09:03 UTC; 1 day 15h ago
Main PID: 2265007 (code=exited, status=0/SUCCESS)
CPU: 768ms

Dec 02 23:08:57 ps7-r1-n1 systemd[1]: Started Service for snap application dcgm.dcgm-exporter.
Dec 02 23:08:58 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:08:58Z" level=info msg="Starting dcgm-exporter"
Dec 02 23:08:58 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:08:58Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Dec 02 23:09:03 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:09:03Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n>
Dec 02 23:09:03 ps7-r1-n1 systemd[1]: snap.dcgm.dcgm-exporter.service: Deactivated successfully.

journalctl logs dcgm exporter are here: https://pastebin.canonical.com/p/wwWsrcZCtt/

Note that NVIDIA driver is blacklisted during the boot time to allow successful pci-passthrough.

snap list dcgm
Name Version Rev Tracking Publisher Notes
dcgm 3.3.8 31 latest/stable canonical✓ -

Ubuntu Jammy 22.04

*-display UNCLAIMED
description: 3D controller
product: GA100GL [A30 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:a1:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm cap_list
configuration: latency=0
resources: iomemory:af00-aeff iomemory:b000-afff memory:b0000000-b0ffffff memory:af000000000-af7ffffffff memory:b0010000000-b0011ffffff memory:b1000000-b11fffff memory:af800000000-affffffffff memory:b0000000000-b000fffffff

syncronize-issues-to-jira · 2024-12-04T14:51:24Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-974.

This message was autogenerated

aieri · 2024-12-04T19:38:53Z

hi @d-kazantsev the current charm logic installs dcgm if an NVIDIA gpu is found via lshw (see

hardware-observer-operator/src/hw_tools.py

Line 671 in 174389c

def nvidia_gpu_verifier() -> Set[HWTool]:

). I'm thinking we could make the logic more stringent by adding a "...and the driver isn't blacklisted" clause.

Could you please verify a couple of things:

lshw -c display still shows the vendor being NVIDIA
you blacklisted the driver by adding a blacklist nvidia* string to /etc/modprobe.conf or another conffile under /etc/modprobe.d/

thanks

If the sysadmin wants to pass the gpu to a virtual instance via pci passthrough, they will need to make the gpu unavailable to the host system by blacklisting[0] the kernel driver. On such a system DCGM would not be able to function and should therefore not be deployed. This commit makes the NVIDIA gpu verifier more strict by only marking DCGM as an available tool if both an NVIDIA gpu is detected *and* the kernel module is not blacklisted. Fixes: canonical#362 [0] https://wiki.debian.org/KernelModuleBlacklisting

aieri · 2024-12-04T23:33:33Z

@d-kazantsev I have created a draft PR with a possible fix but I would like to ensure it would solve your specific use case before completing it and merging it. Please take a look

d-kazantsev · 2024-12-05T08:09:21Z

Hi Andrea, thanks for taking care about this bug. Answering your questions:

Yes lshw still returns NVIDIA GPU:

sudo lshw -c display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:66:00.0
logical name: /dev/fb0
version: 52
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller bus_master cap_list fb
configuration: depth=32 driver=ast latency=0 resolution=1024,768
resources: irq:310 memory:ce000000-ceffffff memory:cf240000-cf27ffff ioport:7000(size=128)
*-display UNCLAIMED
description: 3D controller
product: GA100GL [A30 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:a1:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm cap_list
configuration: latency=0
resources: iomemory:af00-aeff iomemory:b000-afff memory:b0000000-b0ffffff memory:af000000000-af7ffffffff memory:b0010000000-b0011ffffff memory:b1000000-b11fffff memory:af800000000-affffffffff memory:b0000000000-b000fffffff

I added nvidia blacklist via kernel parameters during the boot time:
f.e. BOOT_IMAGE=/boot/vmlinuz-6.8.0-48-generic root=/dev/mapper/vg0-lvroot ro console=tty0 console=ttyS0,115200n8 nvme_core.multipath=0 amd_iommu=on iommu=pt probe_vf=0 transparent_hugepage=never hugepagesz=1G hugepages=2000 default_hugepagesz=1G vfio_iommu_type1.allow_unsafe_interrupts=1 modprobe.blacklist=nouveau,nvidiafb

aieri · 2024-12-05T18:49:34Z

I see, then my current proposal would not be sufficient: if you're modifying the kernel parameters via grub config I also need to look into /proc/cmdline

Deezzir mentioned this issue Dec 5, 2024

Fix and restructure functional tests #343

Merged

aieri linked a pull request Dec 6, 2024 that will close this issue

Don't install DCGM if the driver has been blacklisted #363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to communicate with NVIDIA driver. See more details in the logs #362

Failed to communicate with NVIDIA driver. See more details in the logs #362

d-kazantsev commented Dec 4, 2024

syncronize-issues-to-jira bot commented Dec 4, 2024

aieri commented Dec 4, 2024

aieri commented Dec 4, 2024

d-kazantsev commented Dec 5, 2024 •

edited

Loading

aieri commented Dec 5, 2024

Failed to communicate with NVIDIA driver. See more details in the logs #362

Failed to communicate with NVIDIA driver. See more details in the logs #362

Comments

d-kazantsev commented Dec 4, 2024

syncronize-issues-to-jira bot commented Dec 4, 2024

aieri commented Dec 4, 2024

aieri commented Dec 4, 2024

d-kazantsev commented Dec 5, 2024 • edited Loading

aieri commented Dec 5, 2024

d-kazantsev commented Dec 5, 2024 •

edited

Loading