Application crashes when using explicit sync #110

Molytho · 2024-05-22T13:11:09Z

While I don't use KDE I have similar issues with sway although the only application I encountered such crashes where firefox and thunderbird.

The issue here is a race between sending requests to the compositor:

[3880880.548] -> [email protected]_region(new id wl_region@83)
[3880880.573] -> [email protected](0, 0, 1916, 1078)
[3880880.584] -> [email protected]_opaque_region(wl_region@83)
[3880880.591] -> [email protected]()
[3880880.603] -> [email protected](wl_buffer@71, 0, 0)
[3880880.631] -> [email protected]()
[3880880.687] -> [email protected]_acquire_point(wp_linux_drm_syncobj_timeline_v1@53, 0, 8)
[3880880.698] -> [email protected]_release_point(wp_linux_drm_syncobj_timeline_v1@82, 0, 2)
[3880880.704] -> [email protected](0, 0, 1916, 1078)
[3880880.712] -> [email protected]()
[3880880.732] -> [email protected](new id wl_callback@79)

full log

This is a portion of the WAYLAND_DEBUG log when such a crash occurs (from thunderbird).
What I imagine happens here is that one thread calls set_opaque_region on the surface while another attaches a new buffer.
The commit after the set_opaque_region call happens to be in between attaching the buffer and setting the corresponding syncobj points. This is per definition a protocol violation of wp_linux_drm_syncobj_surface_v1.

So the real issue is that there is no way to send multiple request (like attach, set_acquire_point, set_release_point) atomically to the compositor.

MaLoLHD · 2024-05-22T17:12:58Z

I have also seen this issue on GNOME. I've also used WAYLAND_DEBUG to get the log for this, here are the last few lines:

[3561462.592] -> [email protected]_region(new id wl_region@77)
[3561462.607] -> [email protected](0, 0, 1920, 1048)
[3561462.613] -> [email protected]_opaque_region(wl_region@77)
[3561462.617] -> [email protected]()
[3561462.625] -> [email protected](wl_buffer@74, 0, 0)
[3561462.638] -> [email protected]()
[3561462.646] -> [email protected]_acquire_point(wp_linux_drm_syncobj_timeline_v1@55, 0, 13)
[3561462.652] -> [email protected]_release_point(wp_linux_drm_syncobj_timeline_v1@72, 0, 5)
[3561462.658] -> [email protected](0, 0, 1920, 1048)
[3561462.663] -> [email protected]()
[3561462.668] -> [email protected](new id wl_callback@49)
[3561462.792] [email protected]_id(77)
[3561462.801] [email protected](wp_linux_drm_syncobj_surface_v1@68, 4, "No Acquire point provided")
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1@68: error 4: No Acquire point provided
(t=1.03728) [GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1@68: error 4: No Acquire point provided

Full log

System:
GNOME 46.1 | Mutter 46.1
Fedora 40 | Kernel 6.8.10-300.fc40.x86_64
NVIDIA Driver 555.42.02 installed from NVIDIA's website (not from RPMFusion's packages)
NVIDIA Geforce GTX 1060 6GB

Note that I had to add this line to /etc/modprobe.d/nvidia.conf to get the Wayland session to show up in GDM. Regardless, I had to do this for other driver versions as well, and I don't think it has influenced the bug:

options nvidia "NVreg_PreserveVideoMemoryAllocations=1"

Arcitec · 2024-05-25T23:38:39Z

What I imagine happens here is that one thread calls set_opaque_region on the surface while another attaches a new buffer.
The commit after the set_opaque_region call happens to be in between attaching the buffer and setting the corresponding syncobj points. This is per definition a protocol violation of wp_linux_drm_syncobj_surface_v1.

If I understand the log correctly, does it mean that the applications such as Firefox are violating the Wayland explicit sync protocol by committing contents to an explicit sync surface before it's been allocated?

And if that's the case, this is something that needs fixes elsewhere (in GUI toolkits?), not in the NVIDIA driver.

Molytho · 2024-05-25T23:54:15Z

No, Firefox doesn't knowingly violate the explicit sync protocol. There are likely two threads doing wayland compositor calls in parallel (which is perfectly fine. The functions are thread safe) leading to an invalid sequence of individual calls.
It's also not a bug in NVIDIA's code.
I sadly don't thinks that it is easily fixable. It likely needs some changes in the wayland-client library and should probably be discussed in wayland's bug tracker.

TsunamiMommy · 2024-05-26T03:16:58Z

I specifically remember mentions of this exact issue being discussed on the MR for explicit sync. I think the conclusion of that discussion was that the firefox behavior would be marked as a protocol violation. So it'd be up to Mozilla and Thunderbird to fix it. The protocol is working as designed.

MaLoLHD · 2024-05-26T08:15:11Z

I have found that this issue also happens on KDE Plasma 6.0.5 with GTK4/libadwaita applications when a hamburger menu tries to close. This does not happen on GNOME. I have tested it with Curtail, Paper Clip, Foliate and the libadwaita demo.

Log from the adwaita demo:

[3772322.309] [email protected](2361, 1261758, 272, 1)
[3772322.327] [email protected]()
[3772322.392] -> [email protected]_cursor(2356, wl_surface@31, 4, 4)
[3772322.408] -> [email protected](wl_buffer@55, 0, 0)
[3772322.422] -> [email protected]_buffer_scale(1)
[3772322.434] -> [email protected](0, 0, 32, 32)
[3772322.449] -> [email protected]()
[3772322.482] -> [email protected]_cursor(2356, wl_surface@31, 4, 4)
[3772322.503] -> [email protected](wl_buffer@55, 0, 0)
[3772322.516] -> [email protected]_buffer_scale(1)
[3772322.525] -> [email protected](0, 0, 32, 32)
[3772322.538] -> [email protected]()
[3772322.559] -> [email protected]()
[3772322.583] -> [email protected]()
[3772322.595] -> [email protected](nil, 0, 0)
[3772322.609] -> [email protected]()
[3772323.809] -> [email protected]_cursor(2356, wl_surface@31, 4, 4)
[3772323.836] -> [email protected](wl_buffer@55, 0, 0)
[3772323.845] -> [email protected]_buffer_scale(1)
[3772323.853] -> [email protected](0, 0, 32, 32)
[3772323.874] -> [email protected]()
[3772328.831] -> [email protected](new id wl_callback@79)
[3772328.851] -> [email protected](wl_surface@36, new id wp_presentation_feedback@77)
[3772328.854] -> [email protected](0, 0)
[3772329.031] -> [email protected](wl_buffer@63, 0, 0)
[3772329.043] -> [email protected]_acquire_point(wp_linux_drm_syncobj_timeline_v1@51, 0, 24)
[3772329.046] -> [email protected]_release_point(wp_linux_drm_syncobj_timeline_v1@56, 0, 8)
[3772329.063] -> [email protected](0, 0, 922, 698)
[3772329.066] -> [email protected]()
[3772329.068] -> [email protected](new id wl_callback@72)
[3772329.177] [email protected]_id(67)
[3772329.183] [email protected]_id(66)
[3772329.186] [email protected](wp_linux_drm_syncobj_surface_v1@68, 4, "explicit sync is used, but no acquire point is set")
Gdk-Message: 08:54:55.679: Error flushing display: Protocol error

Molytho · 2024-05-26T11:25:02Z

I specifically remember mentions of this exact issue being discussed on the MR for explicit sync. I think the conclusion of that discussion was that the firefox behavior would be marked as a protocol violation. So it'd be up to Mozilla and Thunderbird to fix it. The protocol is working as designed.

Thanks for the note. Found it: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

@MaLoLHD This portion of the log is not very useful. The object (wp_linux_drm_syncobj_surface_v1@68) that violates the protocol is never referenced and we don't know to which surface it is attached.
Could you send the full log?

MaLoLHD · 2024-05-26T11:39:35Z

The object (wp_linux_drm_syncobj_surface_v1@68) that violates the protocol is never referenced and we don't know to which surface it is attached.

Here's the full log

Molytho · 2024-05-26T11:48:26Z

The object is attached to wl_surface@62 which got a null buffer attached.
This is not a protocol violation so it's a bug in kde's implementation.

Arcitec · 2024-05-26T14:25:46Z

Thanks for the note. Found it: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

That was a really important find. Three people from NVIDIA, the XWayland maintainer, and one of the KDE Explicit Sync developers, and others, are all talking about it there. I've read through every reply and can summarize it as follows:

Wayland is thread-safe, but that brings a risk of doing stupid, protocol-breaking things if you don't synchronize your calls into the proper order manually.
Firefox has two simultaneous rendering threads and is doing stupid, protocol-breaking things and is violating the Wayland protocol. They aren't waiting for the 1st thread's surface allocation before they start writing to it in the 2nd thread.
It's Firefox's bug, not Wayland or NVIDIA or GNOME or KDE etc.
There is no interest in modifying Wayland to allow or safeguard against such protocol violations.

I bet there's a Mozilla bug tracker thread about it somewhere too.

MaLoLHD · 2024-05-26T14:44:14Z

I bet there's a Mozilla bug tracker thread about it somewhere too.

It seems that it is being discussed here and here on Mozilla's bug tracker.

Zamundaaa · 2024-06-18T16:44:28Z

No, Firefox doesn't knowingly violate the explicit sync protocol. There are likely two threads doing wayland compositor calls in parallel (which is perfectly fine. The functions are thread safe) leading to an invalid sequence of individual calls.
It's also not a bug in NVIDIA's code.

Wayland as the messaging protocol is thread safe, but access to the wl_surface from different threads is not. What Firefox is doing has always been broken, and always had the potential to cause crashes and bugs. With explicit sync it just gets way more chances for that to actually cause visible problems.

amshafer · 2024-07-19T01:36:22Z

Closing as this is a firefox bug. Thanks everyone for following the discussion about it in the protocol MR. Seems that a firefox fix is on the way.

https://bugzilla.mozilla.org/show_bug.cgi?id=1898476

joebonrichie · 2024-07-23T17:31:12Z

Possibly needs to be reopened see: https://bugzilla.mozilla.org/show_bug.cgi?id=1908825

Still crashing after backporting the following patches to 128.0
https://hg.mozilla.org/mozilla-central/rev/f9323daf7abe
https://hg.mozilla.org/mozilla-central/rev/a264ff9e9f6f
https://hg.mozilla.org/mozilla-central/rev/eb230ecdf8eb

LazarusCat59 · 2024-08-26T07:38:02Z

With egl-wayland versions 1.1.14 or later (currently I am on 1.1.16), kitty crashes on launch with these error messages:

wp_linux_drm_syncobj_surface_v1#38: error 4: Buffer attached but no acquire point set
[0.320] The output buffer does not support sRGB color encoding, colors will be incorrect.
[0.347] [glfw error 65544]: Wayland: fatal display error: Protocol error

Using __NV_DISABLE_EXPLICIT_SYNC=1 or falling back to egl-wayland-1.1.13.1 does fix kitty completely.

There is a closed issue in kitty bug tracker with this exact same issue: kitty#7767

Not sure where else to report this, so here I am.

racsuline mentioned this issue Jun 8, 2024

App Freezing/Crashing on Opening It. flattool/warehouse#117

Closed

yodatak mentioned this issue Jun 23, 2024

Firefox on flatpak on bazzite nvidia 555 crash sometimes ublue-os/bazzite#1268

Closed

Riteo mentioned this issue Jun 27, 2024

Wayland event polling crash when VSync is disabled, with message "explicit sync is used, but no release point is set" godotengine/godot#93669

Closed

amshafer closed this as completed Jul 19, 2024

amshafer mentioned this issue Jul 19, 2024

Version 1.1.14 is killing my firefox and brave #117

Closed

ptr1337 mentioned this issue Jul 20, 2024

OBS Studio began to crash after egl-wayland 1.1.14 was released #118

Open

amshafer mentioned this issue Aug 2, 2024

egl-wayland: Send explicit sync points before attaching a surface #123

Merged

tgxn mentioned this issue Aug 31, 2024

Crashes on startup with Wayland with nvidia-dkms 555.58.02-1 on Arch Soundux/Soundux#709

Open

4 tasks

atomflunder mentioned this issue Sep 27, 2024

GUI crashes: "The Wayland connection experienced a fatal error: Protocol error" ckb-next/ckb-next#1097

Closed

hawkeye116477 mentioned this issue Oct 30, 2024

Waterfox crashes with GNOME BrowserWorks/Waterfox#3588

Open

1 task

bew mentioned this issue Jan 29, 2025

Wezterm closes after monitor turns off and screen locks wezterm/wezterm#6598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application crashes when using explicit sync #110

Application crashes when using explicit sync #110

Molytho commented May 22, 2024

MaLoLHD commented May 22, 2024

Arcitec commented May 25, 2024 •

edited

Loading

Molytho commented May 25, 2024

TsunamiMommy commented May 26, 2024

MaLoLHD commented May 26, 2024

Molytho commented May 26, 2024

MaLoLHD commented May 26, 2024

Molytho commented May 26, 2024

Arcitec commented May 26, 2024 •

edited

Loading

MaLoLHD commented May 26, 2024 •

edited

Loading

Zamundaaa commented Jun 18, 2024

amshafer commented Jul 19, 2024

joebonrichie commented Jul 23, 2024

LazarusCat59 commented Aug 26, 2024

Application crashes when using explicit sync #110

Application crashes when using explicit sync #110

Comments

Molytho commented May 22, 2024

MaLoLHD commented May 22, 2024

Arcitec commented May 25, 2024 • edited Loading

Molytho commented May 25, 2024

TsunamiMommy commented May 26, 2024

MaLoLHD commented May 26, 2024

Molytho commented May 26, 2024

MaLoLHD commented May 26, 2024

Molytho commented May 26, 2024

Arcitec commented May 26, 2024 • edited Loading

MaLoLHD commented May 26, 2024 • edited Loading

Zamundaaa commented Jun 18, 2024

amshafer commented Jul 19, 2024

joebonrichie commented Jul 23, 2024

LazarusCat59 commented Aug 26, 2024

Arcitec commented May 25, 2024 •

edited

Loading

Arcitec commented May 26, 2024 •

edited

Loading

MaLoLHD commented May 26, 2024 •

edited

Loading