Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to communicate with NVIDIA driver. See more details in the logs #362

Open
d-kazantsev opened this issue Dec 4, 2024 · 5 comments · May be fixed by #363
Open

Failed to communicate with NVIDIA driver. See more details in the logs #362

d-kazantsev opened this issue Dec 4, 2024 · 5 comments · May be fixed by #363

Comments

@d-kazantsev
Copy link

I deployed hardware-observer on the node with NVIDIA GPU but it is going to be used in PCI passthrough mode to be attached directly to the VM. When I check hardware-observer status I see this message " Failed to communicate with NVIDIA driver. See more details in the logs"

Logs indicated error in snap.dcgm.dcgm-exporter.service starting:

systemctl status snap.dcgm.dcgm-exporter.service
○ snap.dcgm.dcgm-exporter.service - Service for snap application dcgm.dcgm-exporter
Loaded: loaded (/etc/systemd/system/snap.dcgm.dcgm-exporter.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2024-12-02 23:09:03 UTC; 1 day 15h ago
Main PID: 2265007 (code=exited, status=0/SUCCESS)
CPU: 768ms

Dec 02 23:08:57 ps7-r1-n1 systemd[1]: Started Service for snap application dcgm.dcgm-exporter.
Dec 02 23:08:58 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:08:58Z" level=info msg="Starting dcgm-exporter"
Dec 02 23:08:58 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:08:58Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Dec 02 23:09:03 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:09:03Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n>
Dec 02 23:09:03 ps7-r1-n1 systemd[1]: snap.dcgm.dcgm-exporter.service: Deactivated successfully.

journalctl logs dcgm exporter are here: https://pastebin.canonical.com/p/wwWsrcZCtt/

Note that NVIDIA driver is blacklisted during the boot time to allow successful pci-passthrough.

snap list dcgm
Name Version Rev Tracking Publisher Notes
dcgm 3.3.8 31 latest/stable canonical✓ -

Ubuntu Jammy 22.04

*-display UNCLAIMED
description: 3D controller
product: GA100GL [A30 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:a1:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm cap_list
configuration: latency=0
resources: iomemory:af00-aeff iomemory:b000-afff memory:b0000000-b0ffffff memory:af000000000-af7ffffffff memory:b0010000000-b0011ffffff memory:b1000000-b11fffff memory:af800000000-affffffffff memory:b0000000000-b000fffffff

Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-974.

This message was autogenerated

@aieri
Copy link
Contributor

aieri commented Dec 4, 2024

hi @d-kazantsev the current charm logic installs dcgm if an NVIDIA gpu is found via lshw (see

def nvidia_gpu_verifier() -> Set[HWTool]:
). I'm thinking we could make the logic more stringent by adding a "...and the driver isn't blacklisted" clause.

Could you please verify a couple of things:

  • lshw -c display still shows the vendor being NVIDIA
  • you blacklisted the driver by adding a blacklist nvidia* string to /etc/modprobe.conf or another conffile under /etc/modprobe.d/

thanks

aieri added a commit to aieri/hardware-observer-operator that referenced this issue Dec 4, 2024
If the sysadmin wants to pass the gpu to a virtual instance via pci
passthrough, they will need to make the gpu unavailable to the host
system by blacklisting[0] the kernel driver. On such a system DCGM would
not be able to function and should therefore not be deployed.

This commit makes the NVIDIA gpu verifier more strict by only marking
DCGM as an available tool if both an NVIDIA gpu is detected *and* the
kernel module is not blacklisted.

Fixes: canonical#362

[0] https://wiki.debian.org/KernelModuleBlacklisting
@aieri
Copy link
Contributor

aieri commented Dec 4, 2024

@d-kazantsev I have created a draft PR with a possible fix but I would like to ensure it would solve your specific use case before completing it and merging it. Please take a look

@d-kazantsev
Copy link
Author

d-kazantsev commented Dec 5, 2024

Hi Andrea, thanks for taking care about this bug. Answering your questions:

  1. Yes lshw still returns NVIDIA GPU:

sudo lshw -c display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:66:00.0
logical name: /dev/fb0
version: 52
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller bus_master cap_list fb
configuration: depth=32 driver=ast latency=0 resolution=1024,768
resources: irq:310 memory:ce000000-ceffffff memory:cf240000-cf27ffff ioport:7000(size=128)
*-display UNCLAIMED
description: 3D controller
product: GA100GL [A30 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:a1:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm cap_list
configuration: latency=0
resources: iomemory:af00-aeff iomemory:b000-afff memory:b0000000-b0ffffff memory:af000000000-af7ffffffff memory:b0010000000-b0011ffffff memory:b1000000-b11fffff memory:af800000000-affffffffff memory:b0000000000-b000fffffff

  1. I added nvidia blacklist via kernel parameters during the boot time:
    f.e. BOOT_IMAGE=/boot/vmlinuz-6.8.0-48-generic root=/dev/mapper/vg0-lvroot ro console=tty0 console=ttyS0,115200n8 nvme_core.multipath=0 amd_iommu=on iommu=pt probe_vf=0 transparent_hugepage=never hugepagesz=1G hugepages=2000 default_hugepagesz=1G vfio_iommu_type1.allow_unsafe_interrupts=1 modprobe.blacklist=nouveau,nvidiafb

@aieri
Copy link
Contributor

aieri commented Dec 5, 2024

I see, then my current proposal would not be sufficient: if you're modifying the kernel parameters via grub config I also need to look into /proc/cmdline

@aieri aieri linked a pull request Dec 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants