You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A long running blk io operation (we saw it with a long slow TRIM operation, but expect anything longer than ~30s would do) can cause a Linux guest to attempt to stop and reset the AHCI controller.
The issue appears to be that the SIGCONT which is sent by blockif_cancel and expected to be delivered via the mevent thread calling back into blockif_sigcont_handler never arrives and so blockif_cancel blocks forever. The interesting backtrace is:
Can be reproduced by e.g. introducing a sleep(35) into blockif_proc's BOP_DELETE handler and then running fstrim on a filesystem from inside the guest. ijc@d8af9d6 contains some code to do that along with some debugging around the blockif_cancel code paths.
This seems likely to be down to a different in the kevent/kqueue semantics between FreeBSD (where this code originates via bhyve) and OSX. FreeBSD kqueue(2) and OSX kevent(2) differ in their descriptions of EVFILT_SIGNAL in that the OSX version says explicitly:
Only signals sent to the process, not to a particular thread, will trigger the filter.
While the FreeBSD one does not. The signal is sent with pthread_kill(be->be_tid, SIGCONT); so would be expected to be subject to this caveat.
However there is something I do not understand about the the original code on FreeBSD which makes me reluctant to just start coding a fix.
There are 3 threads involved:
The VCPU I/O emulation thread (ioemu), runs the device model e.g. the pci_ahci.c code
The block IO thread (blkio), performs actual IO onto the underlying backend devices
The mevent thread, listens for various events using kqueue/kevent. This is actually the process' original main thread which calls mevent_dispatch after initialisation.
There are 2 sets queues involved:
The three blockif request queues (bc->bc_freeq, bc->bc_pendq and bc->bc_busyq) which between them containing theBLOCKIF_MAXREQ elements (type struct blockif_elem) bc->bc_reqs. These are protected by bc->bc_mtxandbc->bc_cond`.
The (global) blockif_bse_head which contains a chain of struct blockif_sig_elem *, protected through the use of atomic_cmpset_ptr. Each blockif_sig_elem contains bse_mtx and bse_cond used for completion.
In normal processing the ioemu thread will process an MMIO (e.g. for pci_ahci.c, from ahci_handle_rw, atapi_read, ahci_handle_dsm_trim and others) will call a helper function which passes a blockif_req to blockif_request which calls blockif_enqueue enqueues a blockif_elem then pokes bc->bc_cond.
This will wake the blkio thread (which is a single thread, but can be multiple in bhvye upstream) which was waiting on bc->bc_cond which will then call blockif_dequeue which takes the blockif_elemand sets be->be_status to BUSY and be->be_tid to the thread which is claiming the work (that is, blkio).
The blkio thread will then process the IO via blockif_proc which will issue various blocking system calls. When the I/O completes an ioemu provided callback is called (for pci_ahci.c this would be ata_ioreq_cb or atapi_ioreq_cb, these update the emulation state etc, i.e. marking the command complete), this callback is passed err which is either 0 (success) or an errno value (fail). Finally the blkio thread calls blockif_complete which frees the blockif_elem.
This all seems reasonable enough.
However upon cancellation, which happens in blockif_cancel (in the case of pci_ahci.c this is called from ahci_port_stop from the ioemu thread), things are more complex.
If the blockif_request is not active, that is, the corresponding blockif_elem has not been claimed by the blkio thread (it is on bc->bc_pendq), then it is simply calling blockif_complete. Also simple enough.
However if the blockif_request is active, that is, the corresponding blockif_elem is on bc->bc_busyq and therefore has a non-zero be_tid and a be_status of BUSY then it will allocate a new struct blockif_sig_elem bse on the stack and add it to the global blockif_bse_head (using atomic_cmpset_ptr). It will then send a SIGCONT to the be_tid with pthread_kill(be->be_tid, SIGCONT); and then block waiting for the embedded bse_cond to be signalled and the blockif_sig_elem completed.
So far so good but at this point I lose track of what is going on because the SIGCONT is delivered via kqueue to the mevent thread and the blockif_sigcont_handler callback and not to the blkio thread. The callback handler does nothing other than walk the global list marking each blockif_sig_elem complete and kicking the corresponding bse_cond (which wakes the ioemu thread). It takes no action WRT the blkio thread.
The only way I can see this working on FreeBSD is that receiving the SIGCONT causes the system call which the blkio thread is currently in to return with EINTR (or similar) while delivering the actual signal to another thread via the kevent. This has a subtle dependency on the ordering of the events (the system call must return before the signal handler callback is called) and is not something which is made clear in any of the documentation I've been able to find.
I'm also not sure what happens if the blkio thread is merely on the way to calling the system call at the point where the cancellation signal arrives. Seems like it would block when it actually made the call? Might be harmless (since things rely on these things returning via their normal return path to signal the error) or might result in things not being cancelled as expected.
This needs more thought and investigation, I shall ask on the Bhyve list hence setting down my understanding here.
The text was updated successfully, but these errors were encountered:
A long running blk io operation (we saw it with a long slow TRIM operation, but expect anything longer than ~30s would do) can cause a Linux guest to attempt to stop and reset the AHCI controller.
The issue appears to be that the
SIGCONT
which is sent byblockif_cancel
and expected to be delivered via the mevent thread calling back intoblockif_sigcont_handler
never arrives and soblockif_cancel
blocks forever. The interesting backtrace is:Can be reproduced by e.g. introducing a sleep(35) into
blockif_proc
'sBOP_DELETE
handler and then runningfstrim
on a filesystem from inside the guest. ijc@d8af9d6 contains some code to do that along with some debugging around theblockif_cancel
code paths.This seems likely to be down to a different in the
kevent
/kqueue
semantics between FreeBSD (where this code originates via bhyve) and OSX. FreeBSD kqueue(2) and OSX kevent(2) differ in their descriptions ofEVFILT_SIGNAL
in that the OSX version says explicitly:While the FreeBSD one does not. The signal is sent with
pthread_kill(be->be_tid, SIGCONT);
so would be expected to be subject to this caveat.However there is something I do not understand about the the original code on FreeBSD which makes me reluctant to just start coding a fix.
There are 3 threads involved:
ioemu
), runs the device model e.g. the pci_ahci.c codeblkio
), performs actual IO onto the underlying backend devicesmevent
thread, listens for various events using kqueue/kevent. This is actually the process' originalmain
thread which callsmevent_dispatch
after initialisation.There are 2 sets queues involved:
bc->bc_freeq
,bc->bc_pendq
andbc->bc_busyq
) which between them containing theBLOCKIF_MAXREQ
elements (typestruct blockif_elem
)bc->bc_reqs. These are protected by
bc->bc_mtxand
bc->bc_cond`.blockif_bse_head
which contains a chain ofstruct blockif_sig_elem *
, protected through the use ofatomic_cmpset_ptr
. Eachblockif_sig_elem
containsbse_mtx
andbse_cond
used for completion.In normal processing the
ioemu
thread will process an MMIO (e.g. forpci_ahci.c
, fromahci_handle_rw
,atapi_read
,ahci_handle_dsm_trim
and others) will call a helper function which passes ablockif_req
toblockif_request
which callsblockif_enqueue
enqueues ablockif_elem
then pokesbc->bc_cond
.This will wake the
blkio
thread (which is a single thread, but can be multiple in bhvye upstream) which was waiting onbc->bc_cond
which will then callblockif_dequeue
which takes theblockif_elem
and setsbe->be_status
to BUSY andbe->be_tid
to the thread which is claiming the work (that is,blkio
).The
blkio
thread will then process the IO viablockif_proc
which will issue various blocking system calls. When the I/O completes an ioemu provided callback is called (forpci_ahci.c
this would beata_ioreq_cb
oratapi_ioreq_cb
, these update the emulation state etc, i.e. marking the command complete), this callback is passederr
which is either 0 (success) or an errno value (fail). Finally theblkio
thread callsblockif_complete
which frees theblockif_elem
.This all seems reasonable enough.
However upon cancellation, which happens in
blockif_cancel
(in the case ofpci_ahci.c
this is called fromahci_port_stop
from theioemu
thread), things are more complex.If the
blockif_request
is not active, that is, the correspondingblockif_elem
has not been claimed by theblkio
thread (it is onbc->bc_pendq
), then it is simply callingblockif_complete
. Also simple enough.However if the
blockif_request
is active, that is, the correspondingblockif_elem
is onbc->bc_busyq
and therefore has a non-zerobe_tid
and abe_status
of BUSY then it will allocate a newstruct blockif_sig_elem bse
on the stack and add it to the globalblockif_bse_head
(usingatomic_cmpset_ptr
). It will then send aSIGCONT
to thebe_tid
withpthread_kill(be->be_tid, SIGCONT);
and then block waiting for the embeddedbse_cond
to be signalled and theblockif_sig_elem
completed.So far so good but at this point I lose track of what is going on because the
SIGCONT
is delivered viakqueue
to themevent
thread and theblockif_sigcont_handler
callback and not to theblkio
thread. The callback handler does nothing other than walk the global list marking eachblockif_sig_elem
complete and kicking the correspondingbse_cond
(which wakes theioemu
thread). It takes no action WRT theblkio
thread.The only way I can see this working on FreeBSD is that receiving the
SIGCONT
causes the system call which theblkio
thread is currently in to return withEINTR
(or similar) while delivering the actual signal to another thread via thekevent
. This has a subtle dependency on the ordering of the events (the system call must return before the signal handler callback is called) and is not something which is made clear in any of the documentation I've been able to find.I'm also not sure what happens if the
blkio
thread is merely on the way to calling the system call at the point where the cancellation signal arrives. Seems like it would block when it actually made the call? Might be harmless (since things rely on these things returning via their normal return path to signal the error) or might result in things not being cancelled as expected.This needs more thought and investigation, I shall ask on the Bhyve list hence setting down my understanding here.
The text was updated successfully, but these errors were encountered: