Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia 545.29.06 broken #221

Closed
skygrango opened this issue Nov 10, 2023 · 37 comments
Closed

nvidia 545.29.06 broken #221

skygrango opened this issue Nov 10, 2023 · 37 comments

Comments

@skygrango
Copy link
Contributor

skygrango commented Nov 10, 2023

I wanna try keyboard im support, but I can't even launch the desktop properly

desktop show up, but I couldn't move my cursor, it seems like freezing

did I miss something ?

distro : arch up-to-date
kernel: linux-cachyos 6.6.1-1, boot with nvidia_drm.modeset=1
graphic card : gtx 1080
driver : nvidia 545.29.02-4 / nvidia-utils 545.29.02-2
pkg : cosmic-epoch-git r101.a83f8dc-1 / cosmic-comp 9a04fa2
env :
EGL_PLATFORM=wayland
LIBVA_DRIVER_NAME=nvidia
GBM_BACKEND=nvidia-drm
__GLX_VENDOR_LIBRARY_NAME=nvidia

it show some error in dmesg :

NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
@Drakulix
Copy link
Member

  1. Is that the only GPU in your system?
  2. please don't add EGL, GBM and _GLX environment variables to cosmic-comp. Those are meant for applications and can break stuff in compositors.
  3. Can you post the output of journalctl --user _EXE=/usr/bin/cosmic-comp after such a frozen run please?

@skygrango
Copy link
Contributor Author

skygrango commented Nov 11, 2023

  1. Is that the only GPU in your system?

I have iGPU too, but I never use it before. I can check it again on Monday.

  1. please don't add EGL, GBM and _GLX environment variables to cosmic-comp. Those are meant for applications and can break stuff in compositors.

I follow arch wiki and guideline to setup, it work for me on KDE

please don't add `EGL`, `GBM` and `_GLX
I'm surprised by your answer, do you mean cosmic-comp supports EGLStream?
I never think it will work on wayland, but I can switch to EGLStream to test again

  1. Can you post the output of journalctl --user _EXE=/usr/bin/cosmic-comp after such a frozen run please?

sure, next monday I will do it

@skygrango
Copy link
Contributor Author

skygrango commented Nov 13, 2023

glxinfo | grep "OpenGL renderer"
OpenGL renderer string: NVIDIA GeForce GTX 1080/PCIe/SSE2

2 and 3
It seems that the new version of nvidia driver breaks some compatibility

but I return to version r93 of cosmic-epoch, it still cannot start normally...
with env : https://gist.github.com/skygrango/5925679c41db053eebbaddf3ea075dea
without env : https://gist.github.com/skygrango/f6d685bec781edb44937cf59d88513bd

@skygrango
Copy link
Contributor Author

Unable to become drm master, assuming unprivileged mode is interesting..

@skygrango skygrango changed the title nvidia 545.29.02 broken after submodule update nvidia 545.29.02 broken Nov 13, 2023
@skygrango
Copy link
Contributor Author

skygrango commented Nov 13, 2023

[EGL] 0x300c (BAD_PARAMETER) eglQueryDmaBufModifiersEXT: EGL_BAD_PARAMETER error: In eglQueryDmaBufModifiersEXT: Invalid format I think that is nvidia driver bug

although KDE still work for me

@Drakulix
Copy link
Member

Drakulix commented Nov 13, 2023

  1. Is that the only GPU in your system?

I have iGPU too, but I never use it before. I can check it again on Monday.

That doesn't mean some application might not use it.

Can you do ls -l /dev/dri/by-path, figure out which of those is your nvidia gpu (e.g. together with lspci) and then set COSMIC_RENDER_DEVICE=/dev/dri/renderD12X in your environment (with the nvidia gpu as a render device) to make sure cosmic-comp will not use the iGPU.

but I return to version r93 of cosmic-epoch, it still cannot start normally... with env : https://gist.github.com/skygrango/5925679c41db053eebbaddf3ea075dea without env : https://gist.github.com/skygrango/f6d685bec781edb44937cf59d88513bd

Older versions have a bug prohibiting them to work with the 545 driver, you will need latest master.

  1. please don't add EGL, GBM and _GLX environment variables to cosmic-comp. Those are meant for applications and can break stuff in compositors.

I follow arch wiki and guideline to setup, it work for me on KDE

As I said, those are settings for Applications. cosmic-comp uses for example the egl-device and egl-gbm platforms (not the wayland platform as it by itself isn't a wayland-client) and thus these settings don't need to be set for compositors (just for the applications running on it).

please don't add `EGL`, `GBM` and `_GLX I'm surprised by your answer, do you mean cosmic-comp supports EGLStream? I never think it will work on wayland, but I can switch to EGLStream to test again

No, we don't use EGLstreams, which is also why you have to run with nvidia-drm.modeset=1 and the egl-gbm library installed.

  1. Can you post the output of journalctl --user _EXE=/usr/bin/cosmic-comp after such a frozen run please?
    Unable to become drm master, assuming unprivileged mode is interesting..

[EGL] 0x300c (BAD_PARAMETER) eglQueryDmaBufModifiersEXT: EGL_BAD_PARAMETER error: In eglQueryDmaBufModifiersEXT: Invalid format I think that is nvidia driver bug

Not interesting at all, these are normal on nvidia and don't cause any issues. Can you additionally set RUST_LOG=info please and re-run? The only interesting error is Error rendering, but it is sadly lacking some info.

@skygrango
Copy link
Contributor Author

Thank you for your detailed explanation !

ls -l /dev/dri/by-path
lrwxrwxrwx 1 root root  8 11月 14 11:52 pci-0000:01:00.0-card -> ../card0
lrwxrwxrwx 1 root root 13 11月 14 11:52 pci-0000:01:00.0-render -> ../renderD128
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)

I do have only one render, it's good.

  1. I update all submodule again

cosmic-epoch-git r103.6c000aa-1

log : https://gist.githubusercontent.com/skygrango/0b2bea3b050852bb6e7b56e236a60c28/raw/dc290cb58fa34eca64459e6b47a17ba5884d1042/cosmic-comp-no-env-info-3.log

@skygrango
Copy link
Contributor Author

it show New screen configuration invalid!:, what can i do about this ?

@Drakulix
Copy link
Member

11月 14 11:48:16 cosmic-comp[1516]: thread 'main' panicked at 'Malformed config file: SpannedError { code: MissingStructField { field: "data_control_enabled", outer: Some("StaticConfig") }, position: Position { line: 85, col: 1 } }': src/config/mod.rs:184

You're config file is outdated. Please grab the latest one from master: https://github.com/pop-os/cosmic-comp/raw/master_jammy/config.ron

it show New screen configuration invalid!:, what can i do about this ?

Now we are getting somewhere! That this is an atomic configuration error and that we have the configuration cosmic is trying to set is really helpful to narrow this down.

I just need one additional piece of information. Can you run drm_info on any working setup (e.g. KDE) and post the result? It should be on the aur.

@skygrango
Copy link
Contributor Author

OK !

here is my drm_info : https://gist.github.com/skygrango/2d6ebb4bbcd23fac3600a7eeff9dc094

If you need me to test again, please let me know

@Drakulix
Copy link
Member

Drakulix commented Nov 15, 2023

Ok great, so let me give you a quick rundown of what happens.

We are building a atomic request to setup the screen via the kms-api with a bunch of parameters (so called properties).
Building up this list happens in smithay in this function: https://github.com/Smithay/smithay/blob/master/src/backend/drm/surface/atomic.rs#L683

We take a bunch of properties as a given, because they are mandatory by the spec. So we can safely ignore those and in fact we can see, those are set to sensible values in the log:

AtomicModeReq {
    objects: [
        35,
        41,
        85,
    ],
    count_props_per_object: [
        12,
        2,
        1,
    ],
    props: [
        ...
    ],
    values: [
        0,
        0,
        167772160,
        94371840,
        0,
        0,
        2560,
        1440,
        103,
        79,
        41,
        1,
        1,
        98,
        41,
    ],
}

Object 85 is your display port connector, which gets one property, so it's the last one in the list: "41". That is the ID of the crtc (or the CRTC_ID property), which is the second object we will be looking at.

Object 41 is the CRTC and it gets two properties. The property ACTIVE is set to 1 and the MODE_ID is set to 98. (The latter isn't really important and different from the value KDE is setting in your drm-log, because it is a pointer. They likely point to the same mode - [email protected]. We don't see the data in the log, but I am pretty certain, that this is correct.)

Which leaves us with Object 35, which is Plane 0. A bunch of these values are pretty obvious, e.g. the first 8 are SRC_X, SRC_Y, SRC_W, SRC_H, CRTC_X, CRTC_Y, CRTC_W, CRTC_H. The 41 points the plane to our CRTC, so that is again CRTC_ID, leaving us with 103, 79 and 1.

Looking at smithay's code and drm_info only three possible candidates remain rotation, FB_ID and IN_FENCE_FD, because the plane has no other properties, that smithay is setting.

Rotation is easy, that is the 1, as Plane 0 just accepts a single value here. FB_ID could be either and is again a pointer, IN_FENCE_FD is a file descriptor and could also be either. But the bad thing here is, that IN_FENCE_FD should never be set, because although the driver exposes this property, it doesn't support any other values than -1 (or unset), because it is lacking the capability DRM_CAP_SYNCOBJ (as seen at the top of your drm_info log).

Which is why the atomic request is rejected by the driver and cosmic fails to put anything on the screen.

Now onto the weird part, I fixed this issues with the nvidia 545 driver weeks ago in smithay: Smithay/smithay@dfa75ea

So it should look for the syncobj capability, figure out fencing is not supported and never try to send a value to the driver. And on my systems, that works, somehow on yours we still end up with a value here.

Which leaves to options:

  1. Either you are running an outdated version of cosmic
  2. It somehow still ends up with this value

So first off, are you sure you are running the right version?
cosmic-epoch-git r103.6c000aa-1

This seems suspicious to me, as the AUR package lists r99-4a6621a-1.

Also 6c000aa doesn't even resolve to a known commit of that repository: pop-os/cosmic-epoch@6c000aa

If you did update the submodules locally, note that just changing their commits doesn't check out the new state automatically.

4a6621a does, but is indeed to old. I'll update the cosmic-epoch repository to fix that.

If it turns out to be option 2, how is your rust experience? Could I ask you to debug this with a few more hints? Or would it be better, if I just clutter the log with more details to hopefully figure out remotely, how we end up in this state?

@Drakulix
Copy link
Member

cosmic-epoch updated.

@skygrango
Copy link
Contributor Author

skygrango commented Nov 15, 2023

  1. here is my fork : https://github.com/skygrango/cosmic-epoch
    I can rebase and update submodule again, can you help me to check ?

  2. I have some experience in rust development, but I’m not familiar with drm.
    what I can do for you ? change log level ? add some debug print ? maybe you have to tell me where I should insert a print

@skygrango
Copy link
Contributor Author

I rebase my fork here : https://github.com/skygrango/cosmic-epoch/commits/master

and left the old one cosmic-epoch-git r103.6c000aa-1 here to let you check : https://github.com/skygrango/cosmic-epoch/commits/master_old

@Drakulix
Copy link
Member

I rebase my fork here : https://github.com/skygrango/cosmic-epoch/commits/master

and left the old one cosmic-epoch-git r103.6c000aa-1 here to let you check : https://github.com/skygrango/cosmic-epoch/commits/master_old

They both look fine, the question is how are you building that? With the AUR package? Or by manually building? If it's the latter, you need to make sure to not just git pull, but also update your submodules with git submodule update --init --recursive

@skygrango
Copy link
Contributor Author

skygrango commented Nov 15, 2023

I clone the aur package and modify PKGBUILD to link to my fork, then makepkg. done.

I use git submodule update --remote to update submodule, this seems to work well :)

I will modify aur tomorrow so that the new submodule can be compiled

I left my 7900xtx drm_info here, it work : https://gist.github.com/skygrango/168a042c39b8a1740bf93507290375be

@skygrango
Copy link
Contributor Author

hey, I found that in https://github.com/pop-os/cosmic-comp/blob/master_jammy/Cargo.toml

[dependencies.smithay]
version = "0.3"
git = "https://github.com/smithay/smithay.git"
rev = "74ef59a3f"

maybe we just need to update since this is older than the fix you mentioned Smithay/smithay@dfa75ea

@skygrango
Copy link
Contributor Author

oh sorry, just found that

[patch."https://github.com/Smithay/smithay.git"]
smithay = { git = "https://github.com/smithay//smithay", rev = "d5b352b" }

@andyczerwonka
Copy link

I logged alacritty/alacritty#7372 and obsproject/obs-studio#9870 this morning when the new 545 driver came through. I reverted back to 535 and both are now back to a working state.

@skygrango
Copy link
Contributor Author

I think so, It's nvidia problem even though KDE still work

@Drakulix maybe instead of wasting your energy, let's close this issue first ?

If you still want to know some error messages, I can still help provide information

@skygrango
Copy link
Contributor Author

I saw this commit : elFarto/nvidia-vaapi-driver@9888709

nvidia made stupid design changes ...

//NVIDIA driver v545.29.02 changed the devInfo struct, and partly broke it in the process
//...who adds a field to the middle of an existing struct....

@Drakulix
Copy link
Member

I think so, It's nvidia problem even though KDE still work

Its a problem specific to the nvidia-driver, but not a problem of the driver. smithay sends a fence, when it shouldn't, but I am not convinced yet, that you are using a indeed using a recent enough build of cosmic.

@Drakulix maybe instead of wasting your energy, let's close this issue first ?

Feel free to close this issue at any time, I am just trying to help you with your problem.

If you still want to know some error messages, I can still help provide information

Sure, lets do that.

Try changing this line please to surface.surface = Some(dbg!(target)); and make a debug build of cosmic-comp (cargo build, not cargo build --release !). Then try that and please post the logs again. :)

I saw this commit : elFarto/nvidia-vaapi-driver@9888709

//NVIDIA driver v545.29.02 changed the devInfo struct, and partly broke it in the process //...who adds a field to the middle of an existing struct....

nvidia-vaapi-driver is directly using the unstable nvapi, so there is no "stupid" decision here, they never committed to a stable api in the first place. So changes like these for the 545 driver are absolutely expected.

@skygrango
Copy link
Contributor Author

I made a fork of cosmic-comp

log : https://gist.github.com/skygrango/e183a2f1b386a9c7d5a4ac1dd06cb184

@skygrango
Copy link
Contributor Author

try to run cosmic-comp in tty

log: https://gist.github.com/skygrango/2260f78894ed260bacfb2c2deff92a25

@Drakulix
Copy link
Member

try to run cosmic-comp in tty

log: https://gist.github.com/skygrango/2260f78894ed260bacfb2c2deff92a25

Looks completely fine. Seems like you let it run for 5 seconds, before switching tty again.

@skygrango
Copy link
Contributor Author

my mouse can't move, what could be the reason?

@skygrango
Copy link
Contributor Author

skygrango commented Nov 20, 2023

the situation is not good, because the desktop is slow to show up,
it may cost 30 secs to show desktop, and I can't move my mouse
even if I try to switch to a different tty, it takes more than 10 seconds to work, what should I do to improve it ?
any suggestions for environment variables?

previous driver version of 535 did not have such slowness, and I could use the mouse normally

@Drakulix
Copy link
Member

the situation is not good, because the desktop is slow to show up, it may cost 30 secs to show desktop, and I can't move my mouse even if I try to switch to a different tty, it takes more than 10 seconds to work, what should I do to improve it ? any suggestions for environment variables?

No environment variables, I honestly have no idea, as you don't have any errors in your log and I don't have a machine that replicates this issue.

previous driver version of 535 did not have such slowness, and I could use the mouse normally

I would suggest downgrading for the time being in that case. Possibly open an issue with nvidia, I would hope future updates will fix this on your system.

@skygrango
Copy link
Contributor Author

That sounds very reasonable, let's move on.
Thank you for your support!

@Drakulix
Copy link
Member

Drakulix commented Nov 21, 2023

That sounds very reasonable, let's move on. Thank you for your support!

Thank you for being so patient with this bug.

There are other reports for problems around the new synchronization mechanism of the 545 driver, I am hopeful that later versions with resolve this, but feel free to re-open once the next driver version lands, if this is still not fixed.

@skygrango
Copy link
Contributor Author

skygrango commented Nov 24, 2023

I updated cosmic-comp and tried new driver of nvidia 545.29.06

log : https://gist.github.com/skygrango/a14fb376ca51be273bef8000a481b99a

it show

Compositor bug: Server ignored ImportNotifier for ZwpLinuxBufferParamsV
 { id: ObjectId(zwp_linux_buffer_params_v1@51), version: 4, data: Some(Any { .. }),
handle: WeakHandle { handle: WeakInnerHandle[sys] { .. } } }

545.29.06 driver also does not work properly
If there is no useful information, we can close the issue again

@skygrango skygrango reopened this Nov 24, 2023
@skygrango skygrango changed the title nvidia 545.29.02 broken nvidia 545.29.06 broken Nov 24, 2023
@Drakulix
Copy link
Member

log : https://gist.github.com/skygrango/a14fb376ca51be273bef8000a481b99a

Not a debug log, but the error is again "Error rendering", which hints at the same drm/fence issue as the previous driver version... :/

@skygrango
Copy link
Contributor Author

I'm sorry for forgetting to change log level

here is new one : https://gist.github.com/skygrango/2256086a36e3ee6c7e5deb4b206bdd81

started from tty : https://gist.githubusercontent.com/skygrango/b47770839bb1dfc3b187c802679eb9a7/raw/1f73599355d27c2d03f9d1ee1bd532d70188ffcd/cosmic-comp-dbg-tty.log

tty log has DrmCompositor info if you need it

@Drakulix
Copy link
Member

I'm sorry for forgetting to change log level

here is new one : https://gist.github.com/skygrango/2256086a36e3ee6c7e5deb4b206bdd81

started from tty : https://gist.githubusercontent.com/skygrango/b47770839bb1dfc3b187c802679eb9a7/raw/1f73599355d27c2d03f9d1ee1bd532d70188ffcd/cosmic-comp-dbg-tty.log

tty log has DrmCompositor info if you need it

Both logs look perfectly fine, not even a rendering error, all good until the tty-switch. What results were you seeing exactly here? Still a rendered, but otherwise unresponsive desktop?

@skygrango
Copy link
Contributor Author

Both logs look perfectly fine, not even a rendering error, all good until the tty-switch. What results were you seeing exactly here? Still a rendered, but otherwise unresponsive desktop?

Yes, but I probably need to make slight corrections : The mouse is responsive, but may move once every 30 seconds. :)

@skygrango
Copy link
Contributor Author

we should wait for the next nvidia driver update

@skygrango
Copy link
Contributor Author

nvidia 550.54.14 work !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants