Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux sets incorrect maximum thread count when memory hotplug is enabled #8960

Closed
noskb opened this issue Feb 22, 2024 · 50 comments · Fixed by QubesOS/qubes-core-agent-linux#532
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: core C: kernel diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. pr submitted A pull request has been submitted for this issue. r4.2-vm-bookworm-cur-test r4.2-vm-fc39-cur-test r4.2-vm-fc40-cur-test r4.2-vm-fc41-cur-test r4.2-vm-trixie-cur-test r4.3-vm-bookworm-cur-test r4.3-vm-fc39-cur-test r4.3-vm-fc40-cur-test r4.3-vm-fc41-cur-test r4.3-vm-trixie-cur-test regression A bug in which a supported feature that worked in a prior Qubes OS release has stopped working. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@noskb
Copy link

noskb commented Feb 22, 2024

How to file a helpful issue

Qubes OS release

r4.2

Brief summary

As title. In my case, it crashes when opening more than 30 tabs from a bookmark at once. If memory-hotplug is disabled, this will not occur.

The following message appears in dmesg:

[  126.238281] Sandbox Forked[4155]: segfault at 0 ip 00007d7ce21c5dc6 sp 00007d7cd4e83400 error 6 in libxul.so[7d7ce2077000+5e99000] likely on CPU 2 (core 2, socket 0)
[  126.238317] Code: d4 05 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b e8 dd 28 eb ff 48 8b 05 26 c8 ae 08 48 8d 15 9f ff d4 05 48 89 10 31 c0 <89> 04 25 00 00 00 00 0f 0b 90 8b 47 08 85 c0 75 09 e9 24 62 eb ff
[  126.366966] Sandbox Forked[4159]: segfault at 0 ip 00007d7ce21c5dc6 sp 00007d7cd4e83400 error 6 in libxul.so[7d7ce2077000+5e99000] likely on CPU 2 (core 2, socket 0)
[  126.367021] Code: d4 05 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b e8 dd 28 eb ff 48 8b 05 26 c8 ae 08 48 8d 15 9f ff d4 05 48 89 10 31 c0 <89> 04 25 00 00 00 00 0f 0b 90 8b 47 08 85 c0 75 09 e9 24 62 eb ff
[  130.989734] clipped [mem 0xfee01000-0xffdfffff] to [mem 0xff000000-0xffdfffff] for e820 entry [mem 0xfeff8000-0xfeffffff]
[  131.983851] Sandbox Forked[4216]: segfault at 0 ip 00007d7ce21c5dc6 sp 00007d7cd4e83400 error 6 in libxul.so[7d7ce2077000+5e99000] likely on CPU 2 (core 2, socket 0)
[  131.983891] Code: d4 05 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b e8 dd 28 eb ff 48 8b 05 26 c8 ae 08 48 8d 15 9f ff d4 05 48 89 10 31 c0 <89> 04 25 00 00 00 00 0f 0b 90 8b 47 08 85 c0 75 09 e9 24 62 eb ff
[  132.665206] Sandbox Forked[4224]: segfault at 0 ip 00007d7ce21c5dc6 sp 00007d7cd4e83400 error 6 in libxul.so[7d7ce2077000+5e99000] likely on CPU 0 (core 0, socket 0)
[  132.665263] Code: d4 05 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b e8 dd 28 eb ff 48 8b 05 26 c8 ae 08 48 8d 15 9f ff d4 05 48 89 10 31 c0 <89> 04 25 00 00 00 00 0f 0b 90 8b 47 08 85 c0 75 09 e9 24 62 eb ff
[  133.071101] Sandbox Forked[4227]: segfault at 0 ip 00007d7ce21c5dc6 sp 00007d7cd4e83400 error 6 in libxul.so[7d7ce2077000+5e99000] likely on CPU 3 (core 3, socket 0)
[  133.071149] Code: d4 05 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b e8 dd 28 eb ff 48 8b 05 26 c8 ae 08 48 8d 15 9f ff d4 05 48 89 10 31 c0 <89> 04 25 00 00 00 00 0f 0b 90 8b 47 08 85 c0 75 09 e9 24 62 eb ff
[  134.368914] Sandbox Forked[4230]: segfault at 0 ip 00007d7ce21c5dc6 sp 00007d7cd4e83400 error 6 in libxul.so[7d7ce2077000+5e99000] likely on CPU 1 (core 1, socket 0)
[  134.368949] Code: d4 05 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b e8 dd 28 eb ff 48 8b 05 26 c8 ae 08 48 8d 15 9f ff d4 05 48 89 10 31 c0 <89> 04 25 00 00 00 00 0f 0b 90 8b 47 08 85 c0 75 09 e9 24 62 eb ff
[  134.608815] clipped [mem 0xfee01000-0xffdfffff] to [mem 0xff000000-0xffdfffff] for e820 entry [mem 0xfeff8000-0xfeffffff]

Steps to reproduce

In r4.2 with the latest update, the memory hotplug feature is enabled by default, so additional configuration is not needed.

Create an appvm with sufficient RAM space by running the following in dom0 terminal:

qvm-create ff-crash -l red --prop memory=800 --prop maxmem=8000

then, run the following in ff-crash terminal:

firefox -- google.com facebook.com youtube.com baidu.com yahoo.com amazon.com wikipedia.org qq.com twitter.com slashdot.org google.co.in taobao.com live.com sina.com.cn yahoo.co.jp linkedin.com weibo.com ebay.com google.co.jp yandex.ru bing.com vk.com hao123.com google.de instagram.com t.co msn.com amazon.co.jp tmall.com google.co.uk pinterest.com ask.com reddit.com wordpress.com mail.ru google.fr blogspot.com paypal.com onclickads.net google.com.br

To disable memory-hotplug, run the following in dom0 then restart ff-crash:
qvm-features ff-crash memory-hotplug ''

Expected behavior

No segfaults occur.

Actual behavior

Firefox crashes

@noskb noskb added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Feb 22, 2024
@DemiMarie
Copy link

Ouch.

Can you provide a way to reproduce this with a completely fresh AppVM?

@renehoj
Copy link

renehoj commented Feb 22, 2024

I have the same issue, it stated happening recently maybe a week ago, I'm having multiple FF crashes daily.

Isolated Web Co[1205]: segfault at 1a39be38a0d8 ip 00001a39be38a0d8 sp 00007ffc3eea83b8 error 15 likely on CPU 2 (core 2, socket 0)
Code: fe ff 00 0d 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b 00 00 e8 91 38 be 39 1a fe ff e8 91 38 be 39 1a fe ff <c0> 6f 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b

It happens with a single tab open, when streaming video or using JS heavy sites, and it seems to happen randomly.

@DemiMarie
Copy link

@renehoj Are you using the Fedora Firefox package? I suspect this is a Fedora packaging bug.

@renehoj
Copy link

renehoj commented Feb 22, 2024

No, I'm using Debian 12 minimal with Firefox-ESR. I tried giving my browser qubes 6GB memory, it didn't stop the crashes.

@andrewdavidwong andrewdavidwong added C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. labels Feb 22, 2024
@noskb
Copy link
Author

noskb commented Feb 22, 2024

Ouch.

Can you provide a way to reproduce this with a completely fresh AppVM?

I updated the steps to reproduce section.

@noskb
Copy link
Author

noskb commented Feb 22, 2024

I have the same issue, it stated happening recently maybe a week ago, I'm having multiple FF crashes daily.

Isolated Web Co[1205]: segfault at 1a39be38a0d8 ip 00001a39be38a0d8 sp 00007ffc3eea83b8 error 15 likely on CPU 2 (core 2, socket 0)
Code: fe ff 00 0d 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b 00 00 e8 91 38 be 39 1a fe ff e8 91 38 be 39 1a fe ff <c0> 6f 3c be 39 1a 00 00 90 d0 59 1a 92 7b 00 00 c0 d1 59 1a 92 7b

It happens with a single tab open, when streaming video or using JS heavy sites, and it seems to happen randomly.

I too noticed it at first with random crashes, then I recreated Firefox profiles, tried older versions, Flatpak's, suspected hardware failure, but what I finally ended up with was memory hotplug.

Disabling the memory hotplug fixes the problem like a charm, and I can reproduce it on another laptop with R4.2 installed, which is why I'm reporting the problem.

@DemiMarie
Copy link

Could this be due to a memory allocation failure?

@renehoj
Copy link

renehoj commented Feb 22, 2024

@noskb Is disabling hotplug the same as memory balancing?

Your test pass on my system with 8 GB initial memory, and balancing enabled, but fails with low values like 800 MB.

@noskb
Copy link
Author

noskb commented Feb 22, 2024

@renehoj Does that mean that even with memory hotplug feature disabled, it still fails if the init memory value is low?

@renehoj
Copy link

renehoj commented Feb 22, 2024

No, disabling hotplug also solves the issue with low init memory settings, the system seems fully stable with the feature turned off.

I just didn't know if memory hotplug and memory balancing were doing something similar, turning off memory balancing and/or increasing the init memory also seems to improve stability, when running your test.

@renehoj
Copy link

renehoj commented Feb 25, 2024

I'm still having crashes, even with qvm-features memory-hotplug ''

During the weekend, I've had Firefox fully crash 3 times, not just a single tab.

It doesn't leave any information in the logs except for

Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::GenerateDump cloned child 2312
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::WaitForContinueSignal waiting for continue signal...

@noskb
Copy link
Author

noskb commented Feb 25, 2024

I'm still having crashes, even with qvm-features memory-hotplug ''

During the weekend, I've had Firefox fully crash 3 times, not just a single tab.

It doesn't leave any information in the logs except for

Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::GenerateDump cloned child 2312
Feb 25 09:49:08 browser-streaming qubes.StartApp+firefox-esr-dom0[882]: ExceptionHandler::WaitForContinueSignal waiting for continue signal...

Even with memory balancing disabled and allocating memory statically to AppVM, does firefox still crash during normal use? If so, the most likely cause is a firefox problem or hardware failure.

@renehoj
Copy link

renehoj commented Feb 25, 2024

I only had memory-hotplug disabled, now I'm trying with memory balancing disabled as well.

My guess is that it started after an update this month, I didn't use to have any issues with Firefox, and suddenly it becomes noticeably unstable. It is a problem specifically with Firefox, no other application is crashing, but it could have started after updates to the Linux kernel or Xen.

@andrewdavidwong andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Feb 25, 2024
@renehoj
Copy link

renehoj commented Mar 2, 2024

Disabling both memory-hotplug and memory balancing didn't stop the crashes.

I ended up downgrading the kernel in my browser qubes to 6.6.2-1, and now the crashes seem to have stopped.

@DemiMarie
Copy link

@renehoj Okay, so a kernel regression.

Can you (in a test standalone VM) try doing a kernel bisection to see which upstream commit broke things?

@andrewdavidwong andrewdavidwong added C: kernel regression A bug in which a supported feature that worked in a prior Qubes OS release has stopped working. and removed C: other labels Mar 2, 2024
@renehoj
Copy link

renehoj commented Mar 3, 2024

@DemiMarie I spoke too soon, I just had libxul.so crash again.

Changing the kernel, just like disabling memory_hotplug, will allow the browser to pass noskb's test, but it doesn't stop the crashes.

@DemiMarie
Copy link

@renehoj Ouch.

The usual advice for this kind of problem is “record an rr trace” but that:

  1. Is insecure (it breaks the Firefox sandbox), so I don’t recommend it outside of a disposable VM, if at all.
  2. Does not work on Xen (in any VM) because vPMU is unsupported.

@noskb
Copy link
Author

noskb commented May 27, 2024

The cause seems to be that domU detects initial memory instead of maxmem when memory hotplug is enabled.

A domU with an initial memory of 800 and max memory of 8000:

hotplug enabled

[    0.242574] Memory: 731012K/818812K available (18432K kernel code, 3241K rwdata, 8924K rodata, 5132K init, 6172K bss, 87544K reserved, 0K cma-reserved)

hotplug disabled

[    4.565233] Memory: 7974216K/8191612K available (18432K kernel code, 3241K rwdata, 8924K rodata, 5132K init, 6172K bss, 217140K reserved, 0K cma-reserved)

This made a difference in the kernel parameters with initial values calculated based on the amount of memory.

Comparison of kernel parameters with hotplug enabled and disabled
[user@disp8758]$ diff -u0 hotplug_enabled.txt hotplug_disabled.txt 
--- hotplug_enabled.txt	2024-05-25 07:38:59.305651095 +0000
+++ hotplug_disabled.txt	2024-05-25 07:40:30.884363729 +0000
@@ -236 +236 @@
-fs.epoll.max_user_watches = 162793
+fs.epoll.max_user_watches = 1775193
@@ -239 +239 @@
-fs.fanotify.max_user_marks = 8192
+fs.fanotify.max_user_marks = 64751
@@ -246 +246 @@
-fs.inotify.max_user_watches = 8192
+fs.inotify.max_user_watches = 60757
@@ -392 +392 @@
-kernel.threads-max = 5713
+kernel.threads-max = 62300
@@ -1191 +1191 @@
-user.max_cgroup_namespaces = 2856
+user.max_cgroup_namespaces = 31150
@@ -1193 +1193 @@
-user.max_fanotify_marks = 8192
+user.max_fanotify_marks = 64751
@@ -1195,8 +1195,8 @@
-user.max_inotify_watches = 8192
-user.max_ipc_namespaces = 2856
-user.max_mnt_namespaces = 2856
-user.max_net_namespaces = 2856
-user.max_pid_namespaces = 2856
-user.max_time_namespaces = 2856
-user.max_user_namespaces = 2856
-user.max_uts_namespaces = 2856
+user.max_inotify_watches = 60757
+user.max_ipc_namespaces = 31150
+user.max_mnt_namespaces = 31150
+user.max_net_namespaces = 31150
+user.max_pid_namespaces = 31150
+user.max_time_namespaces = 31150
+user.max_user_namespaces = 31150
+user.max_uts_namespaces = 31150
@@ -1223 +1223 @@
-vm.min_free_kbytes = 57252
+vm.min_free_kbytes = 67584
@@ -1246 +1246 @@
-vm.user_reserve_kbytes = 17578
+vm.user_reserve_kbytes = 131072

It seems that the low value of kernel.threads-max when hotplug is enabled causes resource insufficiency and firefox crashes.

@dylangerdaly
Copy link

dylangerdaly commented Jul 11, 2024

Whoops, I initially blamed this on Mozilla, ended up migrating to a Chromium browser, was there an actual fix in the end?

@marmarek
Copy link
Member

And not fs.epoll.max_user_watches

marmarek added a commit to marmarek/qubes-core-agent-linux that referenced this issue Nov 1, 2024
When VM is started with memory hotplug, the initial memory size is quite
small. It is used for calculating default threads limit, and that in
turns is used to calculate default process count limit per user. For a
VM started with 400MB (the default) both limits are too low for some
threads/processes-heavy applications like Firefox.

Adjust the limits to a static higher value, based on defaults when
memory hotplug is disabled (and rounded to a nice number).

Fixes QubesOS/qubes-issues#8960
@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package core-agent-linux) has been pushed to the r4.3 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo dnf update --enablerepo=qubes-vm-r4.3-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package core-agent-linux) has been pushed to the r4.3 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo dnf update --enablerepo=qubes-vm-r4.3-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package core-agent-linux) has been pushed to the r4.3 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo dnf update --enablerepo=qubes-vm-r4.3-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package core-agent-linux has been pushed to the r4.3 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing bookworm-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package core-agent-linux has been pushed to the r4.3 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing trixie-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@andrewdavidwong andrewdavidwong added C: core pr submitted A pull request has been submitted for this issue. labels Nov 5, 2024
marmarek added a commit to QubesOS/qubes-core-agent-linux that referenced this issue Nov 5, 2024
When VM is started with memory hotplug, the initial memory size is quite
small. It is used for calculating default threads limit, and that in
turns is used to calculate default process count limit per user. For a
VM started with 400MB (the default) both limits are too low for some
threads/processes-heavy applications like Firefox.

Adjust the limits to a static higher value, based on defaults when
memory hotplug is disabled (and rounded to a nice number).

Fixes QubesOS/qubes-issues#8960

(cherry picked from commit cefe875)
@qubesos-bot
Copy link

Automated announcement from builder-github

The package core-agent-linux has been pushed to the r4.2 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing trixie-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package core-agent-linux) has been pushed to the r4.2 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo dnf update --enablerepo=qubes-vm-r4.2-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package core-agent-linux) has been pushed to the r4.2 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo dnf update --enablerepo=qubes-vm-r4.2-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package core-agent-linux has been pushed to the r4.2 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing bookworm-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package core-agent-linux) has been pushed to the r4.2 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo dnf update --enablerepo=qubes-vm-r4.2-current-testing

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: core C: kernel diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. pr submitted A pull request has been submitted for this issue. r4.2-vm-bookworm-cur-test r4.2-vm-fc39-cur-test r4.2-vm-fc40-cur-test r4.2-vm-fc41-cur-test r4.2-vm-trixie-cur-test r4.3-vm-bookworm-cur-test r4.3-vm-fc39-cur-test r4.3-vm-fc40-cur-test r4.3-vm-fc41-cur-test r4.3-vm-trixie-cur-test regression A bug in which a supported feature that worked in a prior Qubes OS release has stopped working. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants