-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871
Comments
https://elixir.bootlin.com/linux/v6.11.6/source/kernel/signal.c#L2486 It looks like when ptracing with PTRACE_SEIZE (which rr does to its children), SIGSTOP generates a ptrace stop on each child. So I think when handling SIGSTOP, we should wait on every thread for the child so that rr's state syncs up with the kernel's ptrace-stop state. I'll try and whip up a patch for that... |
Ah. We don't actually allow the SIGSTOP to be delivered - instead we emulate the group-stop completely in the scheduler. So that is not relevant. |
Well, making some progress. This is a reproduction that crashes with the assertion failure 100% of the time, without chaos mode: https://gist.github.com/KJTsanaktsidis/5fa224d043f0e10d365fa01f2f4b2519. This should form the basis of a good test case, at least. Run with
Output: https://gist.github.com/KJTsanaktsidis/77fd4bf4b35cf7c59da5423917758211 As a side note, if you run the repro on a single core by itself, like so:
you see that the kernel will happily schedule the thread to resume before the thread-group-leader after SIGCONT. So the order of rr scheduling the second thread before the thread-group-leader is perfectly valid, and so "When a process is in an emulated GROUP_STOP, make all of the threads other than the thread-group-leader ineligible for scheduling" is not the correct way to fix this. |
Why not just do the naive thing and only call It makes sense to me that a) threads in blocked syscalls aren't really stopped when we emulate SIGSTOP and b) therefore when we emulate SIGCONT, we need to not resume those threads. |
Yup this actually does work. I was a bit hesitant to actually propose that because I didn't understand why that worked, but I think I've figured that out and wrote it down in the commit message #3874 Thanks as ever for your help! |
This test program:
when recorded under chaos mode, sometimes produces the following assertion failure during recording (about 30% of the time on my computer)
n.b. - I know the above program will have nondeterministic behaviour under chaos mode - the SIGCONT might arrive before the SIGSTOP. I reduced this testcase out of a (buggy) test from the Ruby test suite. But, during recording it should either exit or hang forever - not crash with an assertion failure!
By looking at the
--log all:debug
output and therr dump
output, I have surmised that the order of events is the following:do_nothing
thread spawns and is blocked inpause()
kill
syscall returns, and then the child process callsexit_group
emulate_SIGCONT
is called on the do_nothing thread and the parent main thread here:rr/src/Scheduler.cc
Line 325 in 8b78784
resume_execution
is called on the do_nothing thread hererr/src/Scheduler.cc
Line 328 in 8b78784
is_stopped_
is not true, so it crashes.I guess the problem is that the main thread is actually ptrace-stopped (and we know about it) because we got a real SIGSTOP from the kernel out of waitpid for it, but the do_nothing thread is still blocked in the real
pause
syscall and not ptrace-stopped?I'm not entirely sure how to fix this, which is why this is an issue and not a PR :) But I can think of two options:
apply_group_stop
, send a thread-directed real SIGSTOP signal to each thread, and waitpid it, so that we really ptrace-stop all threads in the process.What would you recommend as the way to go?
The text was updated successfully, but these errors were encountered: