-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to communicate with NVIDIA driver. See more details in the logs #362
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-974.
|
hi @d-kazantsev the current charm logic installs dcgm if an NVIDIA gpu is found via lshw (see hardware-observer-operator/src/hw_tools.py Line 671 in 174389c
Could you please verify a couple of things:
thanks |
If the sysadmin wants to pass the gpu to a virtual instance via pci passthrough, they will need to make the gpu unavailable to the host system by blacklisting[0] the kernel driver. On such a system DCGM would not be able to function and should therefore not be deployed. This commit makes the NVIDIA gpu verifier more strict by only marking DCGM as an available tool if both an NVIDIA gpu is detected *and* the kernel module is not blacklisted. Fixes: canonical#362 [0] https://wiki.debian.org/KernelModuleBlacklisting
@d-kazantsev I have created a draft PR with a possible fix but I would like to ensure it would solve your specific use case before completing it and merging it. Please take a look |
Hi Andrea, thanks for taking care about this bug. Answering your questions:
sudo lshw -c display
|
I see, then my current proposal would not be sufficient: if you're modifying the kernel parameters via grub config I also need to look into |
I deployed hardware-observer on the node with NVIDIA GPU but it is going to be used in PCI passthrough mode to be attached directly to the VM. When I check hardware-observer status I see this message " Failed to communicate with NVIDIA driver. See more details in the logs"
Logs indicated error in snap.dcgm.dcgm-exporter.service starting:
systemctl status snap.dcgm.dcgm-exporter.service
○ snap.dcgm.dcgm-exporter.service - Service for snap application dcgm.dcgm-exporter
Loaded: loaded (/etc/systemd/system/snap.dcgm.dcgm-exporter.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2024-12-02 23:09:03 UTC; 1 day 15h ago
Main PID: 2265007 (code=exited, status=0/SUCCESS)
CPU: 768ms
Dec 02 23:08:57 ps7-r1-n1 systemd[1]: Started Service for snap application dcgm.dcgm-exporter.
Dec 02 23:08:58 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:08:58Z" level=info msg="Starting dcgm-exporter"
Dec 02 23:08:58 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:08:58Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Dec 02 23:09:03 ps7-r1-n1 dcgm.dcgm-exporter[2265007]: time="2024-12-02T23:09:03Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n>
Dec 02 23:09:03 ps7-r1-n1 systemd[1]: snap.dcgm.dcgm-exporter.service: Deactivated successfully.
journalctl logs dcgm exporter are here: https://pastebin.canonical.com/p/wwWsrcZCtt/
Note that NVIDIA driver is blacklisted during the boot time to allow successful pci-passthrough.
snap list dcgm
Name Version Rev Tracking Publisher Notes
dcgm 3.3.8 31 latest/stable canonical✓ -
Ubuntu Jammy 22.04
*-display UNCLAIMED
description: 3D controller
product: GA100GL [A30 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:a1:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm cap_list
configuration: latency=0
resources: iomemory:af00-aeff iomemory:b000-afff memory:b0000000-b0ffffff memory:af000000000-af7ffffffff memory:b0010000000-b0011ffffff memory:b1000000-b11fffff memory:af800000000-affffffffff memory:b0000000000-b000fffffff
The text was updated successfully, but these errors were encountered: