"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871

KJTsanaktsidis · 2024-11-06T16:39:19Z

This test program:

#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <pthread.h>

void *do_nothing(void *ctx) {
  while (1) { pause(); }
  return NULL;
}

int main(int argc, char **argv) {
  pthread_t th;
  pthread_create(&th, NULL, do_nothing, NULL);
  pid_t pid = fork();
  if (pid == 0) {
    sleep(1);
    kill(getppid(), SIGCONT);
  } else {
    kill(getpid(), SIGSTOP);
    waitpid(pid, NULL, 0);
  }
}

when recorded under chaos mode, sometimes produces the following assertion failure during recording (about 30% of the time on my computer)

[FATAL src/Task.cc:1562:resume_execution() errno: ECHILD] 
 (task 144975 (rec:144975) at time 281)
 -> Assertion `is_stopped_' failed to hold.

n.b. - I know the above program will have nondeterministic behaviour under chaos mode - the SIGCONT might arrive before the SIGSTOP. I reduced this testcase out of a (buggy) test from the Ruby test suite. But, during recording it should either exit or hang forever - not crash with an assertion failure!

By looking at the --log all:debug output and the rr dump output, I have surmised that the order of events is the following:

The do_nothing thread spawns and is blocked in pause()
The child process is forked
Then, the parent process sends SIGSTOP to itself
Then, the child process sleep finishes and sends SIGCONT to the parent
The kill syscall returns, and then the child process calls exit_group
Then, the scheduler decides to execute the do_nothing thread next.
emulate_SIGCONT is called on the do_nothing thread and the parent main thread here:

rr/src/Scheduler.cc

Line 325 in 8b78784

t->emulate_SIGCONT();
Then resume_execution is called on the do_nothing thread here

rr/src/Scheduler.cc

Line 328 in 8b78784

if (t->resume_execution(RESUME_SYSCALL, RESUME_WAIT_NO_EXIT, RESUME_NO_TICKS)) {
But is_stopped_ is not true, so it crashes.

I guess the problem is that the main thread is actually ptrace-stopped (and we know about it) because we got a real SIGSTOP from the kernel out of waitpid for it, but the do_nothing thread is still blocked in the real pause syscall and not ptrace-stopped?

I'm not entirely sure how to fix this, which is why this is an issue and not a PR :) But I can think of two options:

When a process is in an emulated GROUP_STOP, make all of the threads other than the thread-group-leader ineligible for scheduling. Then, we would always resume the main thread first, which is in a ptrace-stop, and can handle the SIGCONT signal. I suppose this might make some sense because a process-directed SIGSTOP/SIGCONT is actually always sent to the thread-group leader right?
When emulating a GROUP_STOP in apply_group_stop, send a thread-directed real SIGSTOP signal to each thread, and waitpid it, so that we really ptrace-stop all threads in the process.

What would you recommend as the way to go?

The text was updated successfully, but these errors were encountered:

KJTsanaktsidis · 2024-11-06T23:21:09Z

https://elixir.bootlin.com/linux/v6.11.6/source/kernel/signal.c#L2486

It looks like when ptracing with PTRACE_SEIZE (which rr does to its children), SIGSTOP generates a ptrace stop on each child. So I think when handling SIGSTOP, we should wait on every thread for the child so that rr's state syncs up with the kernel's ptrace-stop state.

I'll try and whip up a patch for that...

KJTsanaktsidis · 2024-11-07T00:25:23Z

Ah. We don't actually allow the SIGSTOP to be delivered - instead we emulate the group-stop completely in the scheduler. So that is not relevant.

KJTsanaktsidis · 2024-11-07T21:25:48Z

Well, making some progress. This is a reproduction that crashes with the assertion failure 100% of the time, without chaos mode: https://gist.github.com/KJTsanaktsidis/5fa224d043f0e10d365fa01f2f4b2519. This should form the basis of a good test case, at least.

Run with

gcc -o sched_repro sched_repro.c
setsid --wait rr record --wait ./sched_repro

Output: https://gist.github.com/KJTsanaktsidis/77fd4bf4b35cf7c59da5423917758211

As a side note, if you run the repro on a single core by itself, like so:

kjtsanaktsidis@kjtsanaktsidis-laptop ~ % taskset -c 1 setsid --wait ./sched_repro
Sending SIGSTOP to 310063
Sending SIGCONT to 310063
thread tid 310064 continuing
main pid 310063 continuing

you see that the kernel will happily schedule the thread to resume before the thread-group-leader after SIGCONT. So the order of rr scheduling the second thread before the thread-group-leader is perfectly valid, and so "When a process is in an emulated GROUP_STOP, make all of the threads other than the thread-group-leader ineligible for scheduling" is not the correct way to fix this.

rocallahan · 2024-11-08T19:42:01Z

Why not just do the naive thing and only call resume_execution if is_stopped_?

It makes sense to me that a) threads in blocked syscalls aren't really stopped when we emulate SIGSTOP and b) therefore when we emulate SIGCONT, we need to not resume those threads.

KJTsanaktsidis · 2024-11-09T04:01:48Z

Why not just do the naive thing and only call resume_execution if is_stopped_?

Yup this actually does work. I was a bit hesitant to actually propose that because I didn't understand why that worked, but I think I've figured that out and wrote it down in the commit message #3874

Thanks as ever for your help!

KJTsanaktsidis mentioned this issue Nov 9, 2024

Fix SIGCONT handling on threads blocked in syscalls #3874

Merged

rocallahan closed this as completed in #3874 Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871

"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871

KJTsanaktsidis commented Nov 6, 2024

KJTsanaktsidis commented Nov 6, 2024

KJTsanaktsidis commented Nov 7, 2024 •

edited

Loading

KJTsanaktsidis commented Nov 7, 2024

rocallahan commented Nov 8, 2024 •

edited

Loading

KJTsanaktsidis commented Nov 9, 2024

"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871

"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871

Comments

KJTsanaktsidis commented Nov 6, 2024

KJTsanaktsidis commented Nov 6, 2024

KJTsanaktsidis commented Nov 7, 2024 • edited Loading

KJTsanaktsidis commented Nov 7, 2024

rocallahan commented Nov 8, 2024 • edited Loading

KJTsanaktsidis commented Nov 9, 2024

KJTsanaktsidis commented Nov 7, 2024 •

edited

Loading

rocallahan commented Nov 8, 2024 •

edited

Loading