Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird crashes on system with dual NVIDIA dGPUs under GNOME 44 Wayland #82

Open
KaleidonKep99 opened this issue May 28, 2023 · 8 comments

Comments

@KaleidonKep99
Copy link

Hello. I am having an issue that is closely related to issue #78.

I am using GNOME 44 under Fedora 38, with the latest NVIDIA drivers from RPMFusion.
My computer has two GPUs; the first one is a 1660 Super, which is connected to the first PCIe slot and handles all of my screens, while the second one is a 750 Ti, which I mainly use for small CUDA workloads and for encoding on OBS on Windows.

Since I was thinking about moving from Windows to Linux, I decided to give Fedora a try. I installed it, got the NVIDIA drivers installed through RPMFusion, and it restarted fine.
I noticed though that most of the apps wouldn't start up, instead showing the edges of the windows for a split second before disappearing. I tried switching to X11 and that did fix the issue, but since my main screen runs at a high refresh rate, switching to it would mean having the UI locked at 60Hz.
I switched back to Wayland, and following the log output from journalctl -f while running one of the applications that crash, I see this error:

...
May 28 15:03:27 [redacted] kernel: [drm:nv_drm_prime_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002300] Failed to allocate fence signaling event
...

Firefox seems to give out more info, claiming that more than one GPU from the same vendor was detected via PCI.

...
May 28 15:08:52 [redacted] kernel: [drm:nv_drm_prime_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002300] Failed to allocate fence signaling event
May 28 15:08:52 [redacted] firefox.desktop[16966]: Crash Annotation GraphicsCriticalError: |[0][GFX1-]: More than 1 GPU from same vendor detected via PCI, cannot deduce device (t=0.215424) |[1][GFX1-]: Wayland protocol error: [destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a
May 28 15:08:52 [redacted] firefox.desktop[16966]:  (t=0.634323) [GFX1-]: Wayland protocol error: [destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a
May 28 15:08:52 [redacted] firefox[16966]: Error flushing display: Protocol error
May 28 15:08:52 [redacted] firefox.desktop[17072]: Exiting due to channel error.
...

Checking inxi -Fzx, I see that Wayland is running on the system with no GPUs connected to it.

...
Graphics:
  Device-1: NVIDIA GM107 [GeForce GTX 750 Ti] vendor: Gigabyte driver: nvidia
    v: 530.41.03 arch: Maxwell bus-ID: 23:00.0
  Device-2: NVIDIA TU116 [GeForce GTX 1660 SUPER] vendor: Micro-Star MSI
    driver: nvidia v: 530.41.03 arch: Turing bus-ID: 2d:00.0
...
  Display: wayland server: X.Org v: 22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: X: loaded: N/A
    unloaded: fbdev,modesetting,nvidia,vesa gpu: nvidia,nvidia-nvswitch
    resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz 3: 2560x1440~180Hz
    4: 1920x1080~60Hz
  API: OpenGL v: 4.6.0 NVIDIA 530.41.03 renderer: NVIDIA GeForce GTX 750
    Ti/PCIe/SSE2 direct-render: Yes
...

I then proceeded to disable the 750 Ti manually, by doing sudo nvidia-smi drain -p 0000:23:00.0 -m 1, and the output from inxi changed to this:

...
  Display: wayland server: X.Org v: 22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: X: loaded: N/A
    unloaded: fbdev,modesetting,nvidia,vesa gpu: nvidia,nvidia-nvswitch
    resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz 3: 2560x1440~180Hz
    4: 1920x1080~60Hz
  API: OpenGL v: N/A renderer: N/A direct-render: N/A
...

Weirdly enough though, all the applications that kept crashing earlier, now work fine. Checking with nvidia-smi, they also seem to be rendering on the right GPU with all the screens connected to it:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 S...    Off| 00000000:2D:00.0  On |                  N/A |
| 45%   49C    P0               30W / 125W|   1902MiB /  6144MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2413      G   /usr/bin/gnome-shell                        569MiB |
|    0   N/A  N/A      2960      G   /usr/bin/gnome-software                       4MiB |
|    0   N/A  N/A      3316      G   /usr/libexec/xdg-desktop-portal-gnome         4MiB |
|    0   N/A  N/A      4037      G   /usr/bin/Xwayland                            72MiB |
|    0   N/A  N/A     12703      G   discord-screenaudio                         705MiB |
|    0   N/A  N/A     14059      G   /app/bin/discord-screenaudio                  1MiB |
|    0   N/A  N/A     17193      G   /usr/lib64/firefox/firefox                  126MiB |
+---------------------------------------------------------------------------------------+

My question is, is there a way to force Wayland to use a specific GPU as the main one? Having to disable the 750 Ti means losing my secondary device for CUDA/encoding, which I need for specific workloads.

Full specs of my computer:
AMD Ryzen 5900X @ 5GHz
MSI MPG X570 Gaming Edge WiFi
NVIDIA GeForce GTX 1660 Super
NVIDIA GeForce GTX 750 Ti
NVIDIA driver 3:530.41.03-1.fc38
Fedora 38 Workstation w/ GNOME 44

Installed Wayland packages:

egl-wayland.x86_64                                              1.1.11-3.fc38
libxcb.i686                                                     1.13.1-11.fc38  
libxcb.x86_64                                                   1.13.1-11.fc38
xorg-x11-server-Xwayland.x86_64                                 22.1.9-2.fc38
@erik-kz
Copy link

erik-kz commented May 30, 2023

Looking at the choose_primary_gpu_unchecked function in the mutter code-base, it seems that it will use the boot VGA device by default, or an arbitrary device if none of them have that attribute.

However, it also looks like you can add a "mutter-device-preferred-primary" udev tag to force it to use a particular device. See https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1562

@KaleidonKep99
Copy link
Author

Looking at the choose_primary_gpu_unchecked function in the mutter code-base, it seems that it will use the boot VGA device by default, or an arbitrary device if none of them have that attribute.

However, it also looks like you can add a "mutter-device-preferred-primary" udev tag to force it to use a particular device. See https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1562

Hi. Thank you for your response.
I tried adding a udev tag, but it does not seem to make a difference, the system still tries to do everything on the 750 Ti first.
The primary boot VGA device is indeed the 1660 Super, since no displays are attached to the 750 Ti, and I also do see the boot screen on the former, but I did notice that the UEFI firmware reports the GOP from the 750 Ti and not from the 1660 Super.

Looking at the lspci output, the 750 Ti seems to be on bus 23:00.0, while the 1660 Super is on bus 2d:00.0. This means that the 750 Ti gets priority when loading the firmware, since it is connected to the chipset, which is the first thing that gets initialized on boot. Could that be the issue?

@erik-kz
Copy link

erik-kz commented May 31, 2023

Could we perhaps test that theory by simply swapping the two cards?

@KaleidonKep99
Copy link
Author

That indeed fixes the issue.
image

Now the issue is GNOME ignoring the mutter primary setting…

I’ll try some stuff in the meantime. Maybe I missed a crucial step while making the udev rule.

@KaleidonKep99
Copy link
Author

I don't know what's wrong, it seems like I'm doing everything properly, yet my setting is ignored.
I am now trying to force the rendering to be on the 750 Ti, and I moved my screens to it as well, but it still renders on the 1660 Super, which is connected to the PCIe x16 slot of the chipset.

Here's the udev rule being applied at boot, I checked with udevadm and it reports the right values:

P: /devices/pci0000:00/0000:00:03.1/0000:2d:00.0/drm/card1
M: card1
R: 1
U: drm
T: drm_minor
D: c 226:1
N: dri/card1
L: 0
S: dri/by-path/pci-0000:2d:00.0-card
E: DEVPATH=/devices/pci0000:00/0000:00:03.1/0000:2d:00.0/drm/card1
E: DEVNAME=/dev/dri/card1
E: DEVTYPE=drm_minor
E: MAJOR=226
E: MINOR=1
E: SUBSYSTEM=drm
E: USEC_INITIALIZED=8723126
E: ID_PATH=pci-0000:2d:00.0
E: ID_PATH_TAG=pci-0000_2d_00_0
E: NVME_HOST_IFACE=none
E: ID_FOR_SEAT=drm-pci-0000_2d_00_0
E: DEVLINKS=/dev/dri/by-path/pci-0000:2d:00.0-card
E: TAGS=:mutter-device-preferred-primary:uaccess:seat:master-of-seat:
E: CURRENT_TAGS=:mutter-device-preferred-primary:uaccess:seat:master-of-seat:

Yet inxi -Fzx still reports the 1660 Super as the main renderer, even with no displays attached to it.

Graphics:
  Device-1: NVIDIA TU116 [GeForce GTX 1660 SUPER] vendor: Micro-Star MSI
    driver: nvidia v: 530.41.03 arch: Turing bus-ID: 23:00.0
  Device-2: NVIDIA GM107 [GeForce GTX 750 Ti] vendor: Gigabyte
    driver: nvidia v: 530.41.03 arch: Maxwell bus-ID: 2d:00.0
...
  Display: wayland server: X.Org v: 22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: gpu: nvidia,nvidia-nvswitch
    resolution: 1920x1080~60Hz
  API: OpenGL v: 4.6.0 NVIDIA 530.41.03 renderer: NVIDIA GeForce GTX 1660
    SUPER/PCIe/SSE2 direct-render: Yes

@erik-kz
Copy link

erik-kz commented Jun 1, 2023

The only other thing I can think of would be to apply the tag to the render node (/dev/dri/renderDXXX) instead of or in addition to the primary node (/dev/dri/card1).

If that doesn't work, it might be worth bringing this up with the GNOME devs. They would probably be able to provide more informed guidance.

Oh yeah, I should also mention that the Failed to allocate fence signaling event error message is safe to ignore. Also it should be gone with the latest 535 driver.

@KaleidonKep99
Copy link
Author

The only other thing I can think of would be to apply the tag to the render node (/dev/dri/renderDXXX) instead of or in addition to the primary node (/dev/dri/card1).

If that doesn't work, it might be worth bringing this up with the GNOME devs. They would probably be able to provide more informed guidance.

Oh yeah, I should also mention that the Failed to allocate fence signaling event error message is safe to ignore. Also it should be gone with the latest 535 driver.

I'll try applying it to RenderD129 instead then. I'll get back with the results asap.

@KaleidonKep99
Copy link
Author

KaleidonKep99 commented Jun 1, 2023

Nothing, same error:

Jun 01 16:37:08 [redacted] systemd[2148]: Started dbus-:[email protected].
Jun 01 16:37:08 [redacted] nautilus[4838]: Connecting to org.freedesktop.Tracker3.Miner.Files
Jun 01 16:37:09 [redacted] gnome-shell[2329]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
Jun 01 16:37:09 [redacted] gnome-shell[2329]: WL: error in client communication (pid 4838)
Jun 01 16:37:09 [redacted] nautilus[4838]: Error flushing display: Protocol error
Jun 01 16:37:09 [redacted] systemd[2148]: Started dbus-:[email protected].
Jun 01 16:37:09 [redacted] systemd[2148]: dbus-:[email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 01 16:37:09 [redacted] systemd[2148]: dbus-:[email protected]: Failed with result 'exit-code'.

I'll forward the issue to the GNOME devs.

EDIT: https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/6734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants