Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESXi passthrough of NPU to VM - failed with error -110 #46

Open
lamw opened this issue Sep 8, 2024 · 6 comments
Open

ESXi passthrough of NPU to VM - failed with error -110 #46

lamw opened this issue Sep 8, 2024 · 6 comments

Comments

@lamw
Copy link

lamw commented Sep 8, 2024

Are there additional debug/verbose logs from the NPU Linux driver, I've been able to successfully do PCIe passthrough of the NPU from Intel 14th Gen system, but it looks like it fails to load firmware (-110) but no more details ... trying to understand what could be the cause whether this is on ESXi hypervisor and passthrough or something else ...

Here's snippet from dmesg (this is after installing the required drivers on Ubuntu 24.04)

[    2.036821] intel_vpu 0000:02:05.0: enabling device (0000 -> 0002)
[    2.052504] intel_vpu 0000:02:05.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[    3.078146] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[    3.078427] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[    3.078889] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[    3.093452] intel_vpu 0000:02:05.0: [drm] ivpu_hw_37xx_power_down(): VPU not idle during power down
[    3.095388] intel_vpu: probe of 0000:02:05.0 failed with error -110
@kwachows
Copy link

kwachows commented Sep 9, 2024

Could you please give it a try to load the NPU kernel driver with force_snoop=1 module parameter set? (that is rmmod intel_vpu; modprobe intel_vpu force_snoop=1)

@lamw
Copy link
Author

lamw commented Sep 9, 2024

Using 1.6.0 instructions, looks like force_snoop=1 isn't working?

root@ubuntu:~# rmmod intel_vpu; modprobe intel_vpu force_snoop=1
root@ubuntu:~# dmesg|grep vpu
[    1.911597] intel_vpu 0000:02:05.0: enabling device (0000 -> 0002)
[    1.921169] intel_vpu 0000:02:05.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[    2.980059] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[    2.980328] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[    2.980786] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[    2.987821] intel_vpu 0000:02:05.0: [drm] ivpu_hw_37xx_power_down(): VPU not idle during power down
[    2.987995] intel_vpu: probe of 0000:02:05.0 failed with error -110
[  160.695061] intel_vpu: unknown parameter 'force_snoop' ignored
[  160.697978] intel_vpu 0000:02:05.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[  161.722348] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[  161.722367] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[  161.722387] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[  161.728714] intel_vpu 0000:02:05.0: [drm] ivpu_hw_37xx_power_down(): VPU not idle during power down
[  161.728995] intel_vpu: probe of 0000:02:05.0 failed with error -110

@kwachows
Copy link

It is possible that the issue you are observing might be related to the hypervisor cache configuration.
There is a Patch that enables force_snoop module parameter for intel_vpu driver.
You could try applying this patch or updating kernel to 6.11 that already contains the patch and retry with this parameter set.

@lamw
Copy link
Author

lamw commented Sep 10, 2024

hm ... so I just installed 6.11 kernel

# uname -r
6.11.0-061100rc6-generic

When I run the mmod intel_vpu; modprobe intel_vpu force_snoop=1, I see following in dmesg:

[    4.187932] intel_vpu 0000:13:00.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[    4.188097] intel_vpu 0000:13:00.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[    4.188361] intel_vpu 0000:13:00.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[    4.198809] intel_vpu 0000:13:00.0: [drm] ivpu_hw_power_down(): NPU not idle during power down
[    4.199002] intel_vpu 0000:13:00.0: probe with driver intel_vpu failed with error -110
[  386.779470] intel_vpu 0000:13:00.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[  386.904593] [drm] Initialized intel_vpu 1.0.0 for 0000:13:00.0 on minor 0

Interestingly, even though there's some issues I see VPU now initialized, does this mean its good?

I'm able to see the accel0 device :D

# ls /dev/accel/accel0
/dev/accel/accel0

If I reboot the system w/o using force_snoop=1, then it fails as before

@kwachows
Copy link

kwachows commented Sep 11, 2024

The force_snoop=1 parameter is only activated when explicitly specified with the modprobe command line. To have this parameter enabled by default when the module is loaded, you can create a configuration file in the /etc/modprobe.d/ directory. Please follow these steps:

  1. Create a file /etc/modprobe.d/intel_vpu.conf
  2. Add the following line to the file: options intel_vpu force_snoop=1

After you reboot your system, the force_snoop=1 parameter should be automatically applied when the intel_vpu module is loaded.
As for the log message: [ 386.904593] [drm] Initialized intel_vpu 1.0.0 for 0000:13:00.0 on minor 0 and the presence of /dev/accel/accel0 these are indeed indications that the driver has been successfully initialized and the device is correctly set up.

@lamw
Copy link
Author

lamw commented Sep 11, 2024

Thanks for the commands to persist parameter. I guess I'm trying to understand why this is needed? Is there something missing on how device is being presented to guest preventing initiation by default?

I'm also seeing issue w/device passthru to Windows system which throws classic Error Code 43, is there similar parameter for Windows driver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants