Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Assertion `is_stopped_' failed to hold." with SIGSTOP/SIGCONT in multithreaded program #3871

Closed
KJTsanaktsidis opened this issue Nov 6, 2024 · 5 comments · Fixed by #3874

Comments

@KJTsanaktsidis
Copy link
Contributor

This test program:

#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <pthread.h>

void *do_nothing(void *ctx) {
  while (1) { pause(); }
  return NULL;
}

int main(int argc, char **argv) {
  pthread_t th;
  pthread_create(&th, NULL, do_nothing, NULL);
  pid_t pid = fork();
  if (pid == 0) {
    sleep(1);
    kill(getppid(), SIGCONT);
  } else {
    kill(getpid(), SIGSTOP);
    waitpid(pid, NULL, 0);
  }
}

when recorded under chaos mode, sometimes produces the following assertion failure during recording (about 30% of the time on my computer)

[FATAL src/Task.cc:1562:resume_execution() errno: ECHILD] 
 (task 144975 (rec:144975) at time 281)
 -> Assertion `is_stopped_' failed to hold. 

n.b. - I know the above program will have nondeterministic behaviour under chaos mode - the SIGCONT might arrive before the SIGSTOP. I reduced this testcase out of a (buggy) test from the Ruby test suite. But, during recording it should either exit or hang forever - not crash with an assertion failure!

By looking at the --log all:debug output and the rr dump output, I have surmised that the order of events is the following:

  • The do_nothing thread spawns and is blocked in pause()
  • The child process is forked
  • Then, the parent process sends SIGSTOP to itself
  • Then, the child process sleep finishes and sends SIGCONT to the parent
  • The kill syscall returns, and then the child process calls exit_group
  • Then, the scheduler decides to execute the do_nothing thread next.
  • emulate_SIGCONT is called on the do_nothing thread and the parent main thread here:
    t->emulate_SIGCONT();
  • Then resume_execution is called on the do_nothing thread here
    if (t->resume_execution(RESUME_SYSCALL, RESUME_WAIT_NO_EXIT, RESUME_NO_TICKS)) {
  • But is_stopped_ is not true, so it crashes.

I guess the problem is that the main thread is actually ptrace-stopped (and we know about it) because we got a real SIGSTOP from the kernel out of waitpid for it, but the do_nothing thread is still blocked in the real pause syscall and not ptrace-stopped?

I'm not entirely sure how to fix this, which is why this is an issue and not a PR :) But I can think of two options:

  • When a process is in an emulated GROUP_STOP, make all of the threads other than the thread-group-leader ineligible for scheduling. Then, we would always resume the main thread first, which is in a ptrace-stop, and can handle the SIGCONT signal. I suppose this might make some sense because a process-directed SIGSTOP/SIGCONT is actually always sent to the thread-group leader right?
  • When emulating a GROUP_STOP in apply_group_stop, send a thread-directed real SIGSTOP signal to each thread, and waitpid it, so that we really ptrace-stop all threads in the process.

What would you recommend as the way to go?

@KJTsanaktsidis
Copy link
Contributor Author

https://elixir.bootlin.com/linux/v6.11.6/source/kernel/signal.c#L2486

It looks like when ptracing with PTRACE_SEIZE (which rr does to its children), SIGSTOP generates a ptrace stop on each child. So I think when handling SIGSTOP, we should wait on every thread for the child so that rr's state syncs up with the kernel's ptrace-stop state.

I'll try and whip up a patch for that...

@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Nov 7, 2024

Ah. We don't actually allow the SIGSTOP to be delivered - instead we emulate the group-stop completely in the scheduler. So that is not relevant.

@KJTsanaktsidis
Copy link
Contributor Author

Well, making some progress. This is a reproduction that crashes with the assertion failure 100% of the time, without chaos mode: https://gist.github.com/KJTsanaktsidis/5fa224d043f0e10d365fa01f2f4b2519. This should form the basis of a good test case, at least.

Run with

gcc -o sched_repro sched_repro.c
setsid --wait rr record --wait ./sched_repro

Output: https://gist.github.com/KJTsanaktsidis/77fd4bf4b35cf7c59da5423917758211

As a side note, if you run the repro on a single core by itself, like so:

kjtsanaktsidis@kjtsanaktsidis-laptop ~ % taskset -c 1 setsid --wait ./sched_repro
Sending SIGSTOP to 310063
Sending SIGCONT to 310063
thread tid 310064 continuing
main pid 310063 continuing

you see that the kernel will happily schedule the thread to resume before the thread-group-leader after SIGCONT. So the order of rr scheduling the second thread before the thread-group-leader is perfectly valid, and so "When a process is in an emulated GROUP_STOP, make all of the threads other than the thread-group-leader ineligible for scheduling" is not the correct way to fix this.

@rocallahan
Copy link
Collaborator

rocallahan commented Nov 8, 2024

Why not just do the naive thing and only call resume_execution if is_stopped_?

It makes sense to me that a) threads in blocked syscalls aren't really stopped when we emulate SIGSTOP and b) therefore when we emulate SIGCONT, we need to not resume those threads.

@KJTsanaktsidis
Copy link
Contributor Author

Why not just do the naive thing and only call resume_execution if is_stopped_?

Yup this actually does work. I was a bit hesitant to actually propose that because I didn't understand why that worked, but I think I've figured that out and wrote it down in the commit message #3874

Thanks as ever for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants