-
Notifications
You must be signed in to change notification settings - Fork 588
Linux signals
Each linux task points at a sighandler table. The table says whether signals have either: default disposition, are ignored, or have a userspace handler. The tricky bit is that tasks can share tables, according to the following rules
- After
exec
, the task gets a copy of its previous table, except all handled signals are reset to default (ignored signals aren't changed). - After
fork
, the new child gets a copy of its parent's table. - After
clone
, the new child gets a copy of its parent's table, unless CLONE_SIGHAND is passed. In that case the table is shared.
To the best of my knowledge, only the signal
and sigaction
syscalls can alter sighandler tables. Parsing struct sigaction
is a bit annoying, but we only need the first word from the struct (the sa_handler / sa_sigaction field).
To implement these semantics in rr, we can create a struct representing the sighandler table and give each task a (ref)pointer to one. When the first child is forked, we know what its table will be: a copy of the rr parent's. From there on we assign sighandler tables to tasks by the rules above. Upon a clone(CLONE_SIGHAND), we simply give the new task a pointer to its parent's handler table and bump the refcount.
In my testing, ptrace didn't notify rr of signals queued because they were masked off. So it seems that we don't need to track each tasks signal mask, which would add a lot of complication.
These are the kinds of signal "interruption" that are relevant to this discussion
- Signal delivered, but ignored; i.e., no userspace handler invoked and default action not taken.
- Signal delivered and userspace handler invoked.
- Signal ready to deliver but ptrace tracer declines to deliver it.
Wrt to those interruptions, syscalls fall into the following classes
- Never interrupted, from userspace's perspective; never return -EINTR. Example:
clock_gettime()
. - Interrupted when a delivered signal invokes a user sighandler; return -EINTR in that case. Unless the signal handler is registered with SA_RESTART, in which case the syscall behaves as if the signal was ignored; i.e. the syscall is automatically restarted. Example:
read()
. - Always interrupted when a delivered signal invokes a user sighandler; always return -EINTR. SA_RESTART has no effect. Automatically restarted on delivery of ignored signal. Example:
nanosleep()
.
When a syscall is interrupted by a signal, the kernel first exits the original syscall with a special error code. Tasks themselves can never observe these error codes (except when there are kernel bugs), but ptrace tracers can observe them in tracees. The codes are listed below.
After exiting the original syscall, the user sighandler is invoked, if there was one. (Which can recursively be interrupted by signals in other syscalls, etc.) Next the kernel checks to see if the original syscalls should be restarted per the rules above.
Different mechanisms are required to restart different syscalls. Some syscalls can be restarted simply by invoking the syscall again with the same arguments. read()
is one example. Other syscalls need special treatment, for example if their arguments are time varying. One example is nanosleep: if a 2-second sleep is interrupted by a signal 1 second into the wait, then restarting the sleep with the same arguments would result in a 3 second wait. Linux solves this by creating a "restart block" for the syscall, and the syscall impl writes updated args that can be used to restart the syscall to the restart block. Then to restart the syscall, instead of simply trapping to the original syscall, the kernel sets things up so that the task enters the kernel through a special SYS_restart_syscall entry point, which resumes by using the restart block.
The magic restart codes are
- ERESTARTNOHAND: don't attempt to do anything clever with the signal, just always return EINTR. TODO: not sure if this is entirely correct.
- ERESTARTNOINTR: always restart the syscall with no errno.
- ERESTARTSYS: if the signal action includes SA_RESTART, use the ERESTARTNOINTR behavior. Otherwise, use the ERESTARTNOHAND behavior.
- ERESTART_RESTARTBLOCK: restart the syscall by using the restart block, through a call to
SYS_restart_syscall
.
Examples: an interrupted read()
results in ERESTARTSYS. An interrupted nanosleep()
results in ERESTART_RESTARTBLOCK. An interrupted pselect()
results in ERESTARTNOHAND.
Note: syscalls that are restarted because they were SIG_IGN or were registered as SA_RESTART are restarted with their original arguments, not the fudged arguments that rr sets up (redirection to scratch buffers).
ptrace
makes things slightly more complicated. In addition to the SIG_IGN and SA_RESTART mechanisms available to tracees, ptrace allows the tracer to decline to deliver signals. When tracers are notified of pending tracee signals, the tracee has already exited any in-progress syscall with one of the codes above. Then the tracer can either deliver the signal or not. If the signal isn't delivered, then the tracee behaves almost as if the signal was SIG_IGN or SA_RESTART, with these exceptions
- the syscall is entered through the original entry point (as seen by the tracer), SYS_poll etc; not the SYS_restart_syscall entry point
- the restarted syscall enters with the fudged arguments, the ones that may be redirected to scratch
- as for SIG_IGN/SA_RESTART, the exit point is observed to be SYS_restart_syscall by the tracer
Squelching incoming signals is generally a silly thing to do, but rr is forced to do this because it uses some tracee signals for internal implementation details (rdtsc, time-slice interrupts, and desched notifications, to be precise).
Sources
- http://stackoverflow.com/questions/9576604/what-does-erestartsys-used-while-writing-linux-driver
man ptrace
- linux source
Let's say that a read()
blocks and is interrupted by an SA_RESTART signal, without a signal handler. The read exits with ERESTARTSYS, then rr sees a signal event and attempts to single-step the tracee to determine whether a sighandler is entered. The subsequent waitpid on the thread just hangs.
Repeatedly waitpid(-1)
ing shows the main thread running a few of its syscalls. Finally the original thread becomes runnable with status trace-trap. However, it shows 0 instructions retired, which is what rr keys off of to know if a signal handler was established.
Also, the read()
from the original thread returns 1 ... but without entering read() again. So from rr's perspective, the read exited with ERESTARTSYS and might restart, but to the tracee it looks like the read magically returned 1.
Restarting the blocked thread with PTRACE_SYSCALL delivering the blocked signal works more as expected: the read() restarts, the thread blocks, and we switch it out.
This issue: an ABRT raised on a non-main thread appears to block the non-main thread until the main thread is exited. Behavior not observed with fatal, but not core-dumping, signal TERM.
// Step 0 finish_syscall(task, syscall): if ret is ERESTART*: task.push_event(restart_syscall(ret, syscall)) // Step 1 signal_pending(task, sig): go_to_happy_place(task) if task.sig_disposition(sig) == HANDLER: task.events.push(sighandler(sig)) task.resume(CONT or SYSCALL, sig) enter_syscall(task, syscall): if syscall is sigreturn: task.events.pop_sighandler() // finish, ... return if task.events.top is restart_syscall: restarted_syscall = task.events.pop() if syscall ~= restarted_syscall: log "restarting %s" syscall = restarted_syscall else: log "not restarting %s" restarted_syscall.ret = -EINTR // process_syscall(), or otherwise cleanup restarted_syscall state record_event(restarted_syscall) task.push(syscall) // continue with syscall
Sequences of curated events, to get a sense of what's going on.
-->enter read
[block]
[recv USR1/CHLD]
<--exit read: return -512 (ERESTARTSYS)
-->sighandler
-->enter read
[block]
<--exit read: return 1
<--sigreturn
HANG: trying to single-step into handler
-->enter read
[block]
[recv USR1]
<--exit read: return -512 (ERESTARTSYS)
-->sighandler
-->enter read
[block]
<--exit read: return 1
<--sigreturn
(continue normally ...)
-->enter read
[block]
[recv USR1]
<--exit read: return -512 (ERESTARTSYS)
(-->sighandler)
-->enter read
[block]
<--exit read: return 1
<--sigreturn
-->enter read
[block]
(HANG; expected; waiting for progress in scheduler)
-->enter read
[block]
[recv USR1]
<--exit read: return -512 (ERESTARTSYS)
(-->sighandler)
-->enter read
[block]
[recv USR2]
<--exit read: return -512 (ERESTARTSYS)
(-->sighandler)
-->enter read
[block]
<--exit read: return 1
<--sigreturn
<--sigreturn
(continue normally ...)
-->enter read
[block]
[recv USR1]
<--exit read: return -512 (ERESTARTSYS)
(-->sighandler)
-->enter read
[block]
[recv USR2]
<--exit read: return -512 (ERESTARTSYS)
(-->sighandler)
-->enter read
[block]
<--exit read: return 1
<--sigreturn
-->enter read
[block]
(HANG; expected; waiting for progress in scheduler)
All three read()
syscalls are restarted and successfully read a byte, as the above case would suggest.