Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xvnc crashes with SIGBUS on cross-GPU DRI usage #1772

Open
CendioOssman opened this issue Jun 21, 2024 · 11 comments
Open

Xvnc crashes with SIGBUS on cross-GPU DRI usage #1772

CendioOssman opened this issue Jun 21, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@CendioOssman
Copy link
Member

Describe the bug
If I start Xvnc with -renderNode set to my integrated AMD GPU, and then start an application using my discrete Nvidia GPU, then Xvnc will crash with SIGBUS:

(EE) 
(EE) Backtrace:
(EE) 0: Xvnc (xorg_backtrace+0x82) [0x557530197d42]
(EE) 1: Xvnc (0x55752ffe1000+0x1b7f4c) [0x557530198f4c]
(EE) 2: /lib64/libc.so.6 (0x7f475db30000+0x40710) [0x7f475db70710]
(EE) 3: /lib64/libpixman-1.so.0 (0x7f475e151000+0x8a2d0) [0x7f475e1db2d0]
(EE) 4: /lib64/libpixman-1.so.0 (pixman_blt+0x81) [0x7f475e15f8d1]
(EE) 5: Xvnc (vncDRI3SyncPixmapFromGPU+0x10e) [0x55753004303e]
(EE) 6: Xvnc (0x55752ffe1000+0x622c3) [0x5575300432c3]
(EE) 7: Xvnc (dri3_pixmap_from_fds+0xcf) [0x5575300cfdaf]
(EE) 8: Xvnc (0x55752ffe1000+0xf1309) [0x5575300d2309]
(EE) 9: Xvnc (Dispatch+0x426) [0x557530133f56]
(EE) 10: Xvnc (dix_main+0x46a) [0x557530142d4a]
(EE) 11: /lib64/libc.so.6 (0x7f475db30000+0x2a088) [0x7f475db5a088]
(EE) 12: /lib64/libc.so.6 (__libc_start_main+0x8b) [0x7f475db5a14b]
(EE) 13: Xvnc (_start+0x25) [0x55753003ed75]
(EE) 
(EE) Bus error at address 0x7f4753011000
(EE) 
Fatal server error:
(EE) Caught signal 7 (Bus error). Server aborting
(EE) 

To Reproduce
Steps to reproduce the behavior:

  1. Xvnc -renderNode /dev/dri/renderD128 :2 (assuming renderD128 is the AMD iGPU)
  2. DISPLAY=:2 vkcube --gpu-number 1 (assuming GPU 1 is the Nvidia dGPU)

Expected behavior
vkcube renders perfectly normal on the Xvnc display.

Client (please complete the following information):
No client needed.

Server (please complete the following information):

  • OS: Fedora 40
  • VNC server: TigerVNC
  • VNC server version: 1.14.0 beta
  • Server downloaded from: Built from contrib spec file
  • Server was started using: See above

Additional context
Also crashes with an Intel ARC discrete GPU instead of the Nvidia one.

Does not crash if Xvnc is started with the discrete GPU and the application uses the integrated GPU. Possible bug in AMD driver?

@CendioOssman
Copy link
Member Author

More details available in this thread:

https://lists.freedesktop.org/archives/mesa-dev/2024-June/226245.html

@CendioOssman CendioOssman added the bug Something isn't working label Jun 21, 2024
@CendioHalim
Copy link
Contributor

A bug has been reported to the kernel: https://bugzilla.kernel.org/show_bug.cgi?id=218993

@CendioHalim
Copy link
Contributor

@dcommander
Copy link
Contributor

I observe a bus error when attempting to start a VMware virtual machine with 3D acceleration. VMware uses Vulkan, and the failure seems to occur at exactly the same place as the failure described in this issue. (The symptoms are identical when I start a VMware virtual machine with 3D acceleration vs. when I run vkcube --gpu_number 1.) Symptomatically, a pixmap is allocated from a file descriptor, and a buffer object is successfully imported. However, when attempting to synchronize the buffer object and the pixmap, the pointer obtained from gbm_bo_map() appears to be invalid, so the pixel copy crashes.

@dcommander
Copy link
Contributor

It does appear to be the same issue. If I set VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json to force VMware to use the AMD Vulkan driver, then all is well.

dcommander added a commit to TurboVNC/turbovnc that referenced this issue Jul 12, 2024
(based on the implementation in TigerVNC 1.14 beta)

- Synchronize pixels between DRI3 pixmaps and their corresponding GBM
  buffer objects on an as-needed basis, in response to specific X11
  operations rather than on a schedule.

- Implement the simpler DRI3 v1 interface rather than DRI3 v2.  This
  avoids the need to implement the get_formats(), get_modifiers(), and
  get_drawable_modifiers() methods.

- Use Pixman (which is SIMD-accelerated) to synchronize pixels.

- Hook the DestroyPixmap() screen method to clean up a pixmap's
  corresponding GBM buffer object if there are no more references to the
  pixmap.

- Hook the CloseScreen() screen method to clean up the GBM device and
  close the DRM render node.

To do:

- Synchronize only the pixels that have changed.

Known issues:

TigerVNC/tigervnc#1772
@CendioOssman CendioOssman marked this as a duplicate of #1913 Feb 19, 2025
@seacat17
Copy link

seacat17 commented Feb 19, 2025

Is there any fix found for this issue yet?

EDIT: I have an idea but I don't know how to do that.

How can I run the server entirely on dGPU without using AMD driver? I have 2 GPUs: iGPU is AMD Radeon Vega and dGPU is RTX 3050. Can I run the server on RTX 3050 only?

@CendioOssman
Copy link
Member Author

Is there any fix found for this issue yet?

Please see the upstream bug reports linked above. But no, currently we haven't seen any update from them with a fix.

How can I run the server entirely on dGPU without using AMD driver? I have 2 GPUs: iGPU is AMD Radeon Vega and dGPU is RTX 3050. Can I run the server on RTX 3050 only?

Yes, with the renderNode setting mentioned above. I would guess your dGPU is at /dev/dri/renderD129. Note #1773, though.

You could also see if you can completely disable the iGPU in UEFI, if it's not being used.

@seacat17
Copy link

seacat17 commented Feb 20, 2025 via email

@seacat17
Copy link

seacat17 commented Feb 20, 2025

Yes, with the renderNode setting mentioned above. I would guess your dGPU is at /dev/dri/renderD129. Note #1773, though.

How do I set renderNode for vncsession or vncserver?

EDIT: I figured it out. I needed to use the user config file and add the parameter like this:

rendernode=/dev/dri/renderD129

However, I have a weird issue with performance. The game runs just fine and nvidia-smi reports the load, but the game seems to be unable to unleash the GPU's performance - it only loads it for 15% at best, judging by Manohud stats.

@CendioOssman
Copy link
Member Author

Indeed. Nvidia's driver is incompatible with TigerVNC, so you're not getting the same acceleration as with other drivers. It seems their driver has some basic acceleration, because it seems to be faster than just pure CPU, but it's still way slower than what the GPU should be able to do.

We can't do much about this until either Nvidia becomes more compatible with the open-source driver model, or documents their proprietary magic.

@dcommander
Copy link
Contributor

My understanding from their driver devs is that their proprietary magic is based on DRI2, which allocates GPU buffers on the X server. DRI3 instead allocates GPU buffers in the X client, at the expense of GLX conformance. (Multiple processes cannot render to the same GLX drawable with DRI3, but fortunately few applications need to do that.) nVidia's drivers also make heavy use of their proprietary and undocumented NV-GLX extension. Thus, even if there were documentation for the proprietary magic, we still wouldn't be able to make it work outside of a physical X server.

I strongly suspect that the hack described in #1773 (setting __GLX_VENDOR_LIBRARY_NAME=nvidia) causes the nVidia front end to be used with an unaccelerated back end. In my testing, not only is OpenGL performance sluggish with that hack, but Xvnc becomes sluggish and unresponsive as well. If you recall, prior to the introduction of GLVND, direct rendering with llvmpipe didn't work out of the box in Xvnc if nVidia's proprietary drivers were installed. You had to do a similar environment variable hack to enable Mesa's front end. With nVidia's front end, indirect rendering was used, which monopolized the X server. In my testing, Xvnc's behavior with __GLX_VENDOR_LIBRARY_NAME=nvidia and DRI3 is very reminiscent of that old indirect rendering environment. Maybe nVidia's front end has an unaccelerated fallback mode that allows it to talk to X servers that don't have nVidia's drivers installed, and that mode is activated because the X server doesn't have DRI2 or NV-GLX. That's just a wild guess, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants