Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application crashes when using explicit sync #110

Closed
Molytho opened this issue May 22, 2024 · 14 comments
Closed

Application crashes when using explicit sync #110

Molytho opened this issue May 22, 2024 · 14 comments

Comments

@Molytho
Copy link
Contributor

Molytho commented May 22, 2024

Reference: #104 (comment)

While I don't use KDE I have similar issues with sway although the only application I encountered such crashes where firefox and thunderbird.

The issue here is a race between sending requests to the compositor:

[3880880.548] -> [email protected]_region(new id wl_region@83)
[3880880.573] -> [email protected](0, 0, 1916, 1078)
[3880880.584] -> [email protected]_opaque_region(wl_region@83)
[3880880.591] -> [email protected]()
[3880880.603] -> [email protected](wl_buffer@71, 0, 0)
[3880880.631] -> [email protected]()
[3880880.687] -> [email protected]_acquire_point(wp_linux_drm_syncobj_timeline_v1@53, 0, 8)
[3880880.698] -> [email protected]_release_point(wp_linux_drm_syncobj_timeline_v1@82, 0, 2)
[3880880.704] -> [email protected](0, 0, 1916, 1078)
[3880880.712] -> [email protected]()
[3880880.732] -> [email protected](new id wl_callback@79)

full log

This is a portion of the WAYLAND_DEBUG log when such a crash occurs (from thunderbird).
What I imagine happens here is that one thread calls set_opaque_region on the surface while another attaches a new buffer.
The commit after the set_opaque_region call happens to be in between attaching the buffer and setting the corresponding syncobj points. This is per definition a protocol violation of wp_linux_drm_syncobj_surface_v1.

So the real issue is that there is no way to send multiple request (like attach, set_acquire_point, set_release_point) atomically to the compositor.

@MaLoLHD
Copy link

MaLoLHD commented May 22, 2024

I have also seen this issue on GNOME. I've also used WAYLAND_DEBUG to get the log for this, here are the last few lines:

[3561462.592] -> [email protected]_region(new id wl_region@77)
[3561462.607] -> [email protected](0, 0, 1920, 1048)
[3561462.613] -> [email protected]_opaque_region(wl_region@77)
[3561462.617] -> [email protected]()
[3561462.625] -> [email protected](wl_buffer@74, 0, 0)
[3561462.638] -> [email protected]()
[3561462.646] -> [email protected]_acquire_point(wp_linux_drm_syncobj_timeline_v1@55, 0, 13)
[3561462.652] -> [email protected]_release_point(wp_linux_drm_syncobj_timeline_v1@72, 0, 5)
[3561462.658] -> [email protected](0, 0, 1920, 1048)
[3561462.663] -> [email protected]()
[3561462.668] -> [email protected](new id wl_callback@49)
[3561462.792] [email protected]_id(77)
[3561462.801] [email protected](wp_linux_drm_syncobj_surface_v1@68, 4, "No Acquire point provided")
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1@68: error 4: No Acquire point provided
(t=1.03728) [GFX1-]: Wayland protocol error: wp_linux_drm_syncobj_surface_v1@68: error 4: No Acquire point provided

Full log

System:
GNOME 46.1 | Mutter 46.1
Fedora 40 | Kernel 6.8.10-300.fc40.x86_64
NVIDIA Driver 555.42.02 installed from NVIDIA's website (not from RPMFusion's packages)
NVIDIA Geforce GTX 1060 6GB

Note that I had to add this line to /etc/modprobe.d/nvidia.conf to get the Wayland session to show up in GDM. Regardless, I had to do this for other driver versions as well, and I don't think it has influenced the bug:

options nvidia "NVreg_PreserveVideoMemoryAllocations=1"

@Arcitec
Copy link

Arcitec commented May 25, 2024

What I imagine happens here is that one thread calls set_opaque_region on the surface while another attaches a new buffer.
The commit after the set_opaque_region call happens to be in between attaching the buffer and setting the corresponding syncobj points. This is per definition a protocol violation of wp_linux_drm_syncobj_surface_v1.

If I understand the log correctly, does it mean that the applications such as Firefox are violating the Wayland explicit sync protocol by committing contents to an explicit sync surface before it's been allocated?

And if that's the case, this is something that needs fixes elsewhere (in GUI toolkits?), not in the NVIDIA driver.

@Molytho
Copy link
Contributor Author

Molytho commented May 25, 2024

No, Firefox doesn't knowingly violate the explicit sync protocol. There are likely two threads doing wayland compositor calls in parallel (which is perfectly fine. The functions are thread safe) leading to an invalid sequence of individual calls.
It's also not a bug in NVIDIA's code.
I sadly don't thinks that it is easily fixable. It likely needs some changes in the wayland-client library and should probably be discussed in wayland's bug tracker.

@TsunamiMommy
Copy link

I specifically remember mentions of this exact issue being discussed on the MR for explicit sync. I think the conclusion of that discussion was that the firefox behavior would be marked as a protocol violation. So it'd be up to Mozilla and Thunderbird to fix it. The protocol is working as designed.

@MaLoLHD
Copy link

MaLoLHD commented May 26, 2024

I have found that this issue also happens on KDE Plasma 6.0.5 with GTK4/libadwaita applications when a hamburger menu tries to close. This does not happen on GNOME. I have tested it with Curtail, Paper Clip, Foliate and the libadwaita demo.

Log from the adwaita demo:

[3772322.309] [email protected](2361, 1261758, 272, 1)
[3772322.327] [email protected]()
[3772322.392] -> [email protected]_cursor(2356, wl_surface@31, 4, 4)
[3772322.408] -> [email protected](wl_buffer@55, 0, 0)
[3772322.422] -> [email protected]_buffer_scale(1)
[3772322.434] -> [email protected](0, 0, 32, 32)
[3772322.449] -> [email protected]()
[3772322.482] -> [email protected]_cursor(2356, wl_surface@31, 4, 4)
[3772322.503] -> [email protected](wl_buffer@55, 0, 0)
[3772322.516] -> [email protected]_buffer_scale(1)
[3772322.525] -> [email protected](0, 0, 32, 32)
[3772322.538] -> [email protected]()
[3772322.559] -> [email protected]()
[3772322.583] -> [email protected]()
[3772322.595] -> [email protected](nil, 0, 0)
[3772322.609] -> [email protected]()
[3772323.809] -> [email protected]_cursor(2356, wl_surface@31, 4, 4)
[3772323.836] -> [email protected](wl_buffer@55, 0, 0)
[3772323.845] -> [email protected]_buffer_scale(1)
[3772323.853] -> [email protected](0, 0, 32, 32)
[3772323.874] -> [email protected]()
[3772328.831] -> [email protected](new id wl_callback@79)
[3772328.851] -> [email protected](wl_surface@36, new id wp_presentation_feedback@77)
[3772328.854] -> [email protected](0, 0)
[3772329.031] -> [email protected](wl_buffer@63, 0, 0)
[3772329.043] -> [email protected]_acquire_point(wp_linux_drm_syncobj_timeline_v1@51, 0, 24)
[3772329.046] -> [email protected]_release_point(wp_linux_drm_syncobj_timeline_v1@56, 0, 8)
[3772329.063] -> [email protected](0, 0, 922, 698)
[3772329.066] -> [email protected]()
[3772329.068] -> [email protected](new id wl_callback@72)
[3772329.177] [email protected]_id(67)
[3772329.183] [email protected]_id(66)
[3772329.186] [email protected](wp_linux_drm_syncobj_surface_v1@68, 4, "explicit sync is used, but no acquire point is set")
Gdk-Message: 08:54:55.679: Error flushing display: Protocol error

@Molytho
Copy link
Contributor Author

Molytho commented May 26, 2024

I specifically remember mentions of this exact issue being discussed on the MR for explicit sync. I think the conclusion of that discussion was that the firefox behavior would be marked as a protocol violation. So it'd be up to Mozilla and Thunderbird to fix it. The protocol is working as designed.

Thanks for the note. Found it: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

@MaLoLHD This portion of the log is not very useful. The object (wp_linux_drm_syncobj_surface_v1@68) that violates the protocol is never referenced and we don't know to which surface it is attached.
Could you send the full log?

@MaLoLHD
Copy link

MaLoLHD commented May 26, 2024

The object (wp_linux_drm_syncobj_surface_v1@68) that violates the protocol is never referenced and we don't know to which surface it is attached.

Here's the full log

@Molytho
Copy link
Contributor Author

Molytho commented May 26, 2024

The object is attached to wl_surface@62 which got a null buffer attached.
This is not a protocol violation so it's a bug in kde's implementation.

@Arcitec
Copy link

Arcitec commented May 26, 2024

Thanks for the note. Found it: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

That was a really important find. Three people from NVIDIA, the XWayland maintainer, and one of the KDE Explicit Sync developers, and others, are all talking about it there. I've read through every reply and can summarize it as follows:

  1. Wayland is thread-safe, but that brings a risk of doing stupid, protocol-breaking things if you don't synchronize your calls into the proper order manually.
  2. Firefox has two simultaneous rendering threads and is doing stupid, protocol-breaking things and is violating the Wayland protocol. They aren't waiting for the 1st thread's surface allocation before they start writing to it in the 2nd thread.
  3. It's Firefox's bug, not Wayland or NVIDIA or GNOME or KDE etc.
  4. There is no interest in modifying Wayland to allow or safeguard against such protocol violations.

I bet there's a Mozilla bug tracker thread about it somewhere too.

@MaLoLHD
Copy link

MaLoLHD commented May 26, 2024

I bet there's a Mozilla bug tracker thread about it somewhere too.

It seems that it is being discussed here and here on Mozilla's bug tracker.

@Zamundaaa
Copy link

No, Firefox doesn't knowingly violate the explicit sync protocol. There are likely two threads doing wayland compositor calls in parallel (which is perfectly fine. The functions are thread safe) leading to an invalid sequence of individual calls.
It's also not a bug in NVIDIA's code.

Wayland as the messaging protocol is thread safe, but access to the wl_surface from different threads is not. What Firefox is doing has always been broken, and always had the potential to cause crashes and bugs. With explicit sync it just gets way more chances for that to actually cause visible problems.

@amshafer
Copy link
Collaborator

Closing as this is a firefox bug. Thanks everyone for following the discussion about it in the protocol MR. Seems that a firefox fix is on the way.

https://bugzilla.mozilla.org/show_bug.cgi?id=1898476

@joebonrichie
Copy link

@LazarusCat59
Copy link

With egl-wayland versions 1.1.14 or later (currently I am on 1.1.16), kitty crashes on launch with these error messages:

wp_linux_drm_syncobj_surface_v1#38: error 4: Buffer attached but no acquire point set
[0.320] The output buffer does not support sRGB color encoding, colors will be incorrect.
[0.347] [glfw error 65544]: Wayland: fatal display error: Protocol error

Using __NV_DISABLE_EXPLICIT_SYNC=1 or falling back to egl-wayland-1.1.13.1 does fix kitty completely.

There is a closed issue in kitty bug tracker with this exact same issue: kitty#7767

Not sure where else to report this, so here I am.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants