Deadlock in AHCI stop blockif_cancel due to lack of SIGCONT signal delivery #94

ijc · 2016-12-12T11:56:27Z

A long running blk io operation (we saw it with a long slow TRIM operation, but expect anything longer than ~30s would do) can cause a Linux guest to attempt to stop and reset the AHCI controller.

The issue appears to be that the SIGCONT which is sent by blockif_cancel and expected to be delivered via the mevent thread calling back into blockif_sigcont_handler never arrives and so blockif_cancel blocks forever. The interesting backtrace is:

  thread #13: tid = 0x2d000a5, 0x00007fff8c0aac8a libsystem_kernel.dylib`__psynch_cvwait + 10, name = 'vcpu:1'
    frame #0: 0x00007fff8c0aac8a libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff8c19496a libsystem_pthread.dylib`_pthread_cond_wait + 712
    frame #2: 0x000000010ea97316 com.docker.hyperkit`pci_ahci_write + 128 at block_if.c:899 [opt]
    frame #3: 0x000000010ea97296 com.docker.hyperkit`pci_ahci_write [inlined] ahci_port_stop(p=<unavailable>) + 180 at pci_ahci.c:445 [opt]
    frame #4: 0x000000010ea971e2 com.docker.hyperkit`pci_ahci_write [inlined] pci_ahci_port_write(sc=<unavailable>, offset=<unavailable>, value=<unavailable>) + 492 at pci_ahci.c:2051 [opt]
    frame #5: 0x000000010ea96ff6 com.docker.hyperkit`pci_ahci_write(vcpu=<unavailable>, pi=<unavailable>, baridx=<unavailable>, offset=<unavailable>, size=<unavailable>, value=<unavailable>) + 498 at pci_ahci.c:2156 [opt]
    frame #6: 0x000000010ea9c974 com.docker.hyperkit`pci_emul_mem_handler(vcpu=1, dir=<unavailable>, addr=<unavailable>, size=<unavailable>, val=<unavailable>, arg1=0x00007fd1ca7000a0, arg2=<unavailable>) + 284 at pci_emul.c:394 [opt]
    frame #7: 0x000000010ea9686d com.docker.hyperkit`mem_write(unused=<unavailable>, vcpu=<unavailable>, gpa=<unavailable>, wval=<unavailable>, size=<unavailable>, arg=<unavailable>) + 47 at mem.c:151 [opt]
    frame #8: 0x000000010ea9292b com.docker.hyperkit`vmm_emulate_instruction [inlined] emulate_mov(vm=<unavailable>, vcpuid=<unavailable>, gpa=<unavailable>, vie=<unavailable>, memread=<unavailable>, memwrite=(com.docker.hyperkit`mem_write at mem.c:147), arg=<unavailable>) + 442 at vmm_instruction_emul.c:497 [opt]
    frame #9: 0x000000010ea92771 com.docker.hyperkit`vmm_emulate_instruction(vm=<unavailable>, vcpuid=<unavailable>, gpa=<unavailable>, vie=<unavailable>, paging=<unavailable>, memread=<unavailable>, memwrite=<unavailable>, memarg=<unavailable>) + 2930 at vmm_instruction_emul.c:1421 [opt]
    frame #10: 0x000000010ea91739 com.docker.hyperkit`xh_vm_emulate_instruction(vcpu=<unavailable>, gpa=<unavailable>, vie=<unavailable>, paging=<unavailable>, memread=<unavailable>, memwrite=<unavailable>, memarg=<unavailable>) + 87 at vmm_api.c:828 [opt]
    frame #11: 0x000000010eaab6d0 com.docker.hyperkit`vmexit_inst_emul [inlined] emulate_mem(paddr=<unavailable>, vie=0x000000010edcd670, paging=0x000000010edcd658) + 31 at mem.c:202 [opt]
    frame #12: 0x000000010eaab6b1 com.docker.hyperkit`vmexit_inst_emul(vme=<unavailable>, pvcpu=<unavailable>) + 180 at hyperkit.c:486 [opt]
    frame #13: 0x000000010eaabf3b com.docker.hyperkit`vcpu_thread [inlined] vcpu_loop(vcpu=1) + 116 at hyperkit.c:630 [opt]
    frame #14: 0x000000010eaabec7 com.docker.hyperkit`vcpu_thread(param=<unavailable>) + 1076 at hyperkit.c:276 [opt]
    frame #15: 0x00007fff8c193aab libsystem_pthread.dylib`_pthread_body + 180
    frame #16: 0x00007fff8c1939f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #17: 0x00007fff8c193221 libsystem_pthread.dylib`thread_start + 13

Can be reproduced by e.g. introducing a sleep(35) into blockif_proc's BOP_DELETE handler and then running fstrim on a filesystem from inside the guest. ijc@d8af9d6 contains some code to do that along with some debugging around the blockif_cancel code paths.

This seems likely to be down to a different in the kevent/kqueue semantics between FreeBSD (where this code originates via bhyve) and OSX. FreeBSD kqueue(2) and OSX kevent(2) differ in their descriptions of EVFILT_SIGNAL in that the OSX version says explicitly:

Only signals sent to the process, not to a particular thread, will trigger the filter.

While the FreeBSD one does not. The signal is sent with pthread_kill(be->be_tid, SIGCONT); so would be expected to be subject to this caveat.

However there is something I do not understand about the the original code on FreeBSD which makes me reluctant to just start coding a fix.

There are 3 threads involved:

The VCPU I/O emulation thread (ioemu), runs the device model e.g. the pci_ahci.c code
The block IO thread (blkio), performs actual IO onto the underlying backend devices
The mevent thread, listens for various events using kqueue/kevent. This is actually the process' original main thread which calls mevent_dispatch after initialisation.

There are 2 sets queues involved:

The three blockif request queues (bc->bc_freeq, bc->bc_pendq and bc->bc_busyq) which between them containing theBLOCKIF_MAXREQ elements (type struct blockif_elem) bc->bc_reqs. These are protected by bc->bc_mtxandbc->bc_cond`.
The (global) blockif_bse_head which contains a chain of struct blockif_sig_elem *, protected through the use of atomic_cmpset_ptr. Each blockif_sig_elem contains bse_mtx and bse_cond used for completion.

In normal processing the ioemu thread will process an MMIO (e.g. for pci_ahci.c, from ahci_handle_rw, atapi_read, ahci_handle_dsm_trim and others) will call a helper function which passes a blockif_req to blockif_request which calls blockif_enqueue enqueues a blockif_elem then pokes bc->bc_cond.

This will wake the blkio thread (which is a single thread, but can be multiple in bhvye upstream) which was waiting on bc->bc_cond which will then call blockif_dequeue which takes the blockif_elemand sets be->be_status to BUSY and be->be_tid to the thread which is claiming the work (that is, blkio).

The blkio thread will then process the IO via blockif_proc which will issue various blocking system calls. When the I/O completes an ioemu provided callback is called (for pci_ahci.c this would be ata_ioreq_cb or atapi_ioreq_cb, these update the emulation state etc, i.e. marking the command complete), this callback is passed err which is either 0 (success) or an errno value (fail). Finally the blkio thread calls blockif_complete which frees the blockif_elem.

This all seems reasonable enough.

However upon cancellation, which happens in blockif_cancel (in the case of pci_ahci.c this is called from ahci_port_stop from the ioemu thread), things are more complex.

If the blockif_request is not active, that is, the corresponding blockif_elem has not been claimed by the blkio thread (it is on bc->bc_pendq), then it is simply calling blockif_complete. Also simple enough.

However if the blockif_request is active, that is, the corresponding blockif_elem is on bc->bc_busyq and therefore has a non-zero be_tid and a be_status of BUSY then it will allocate a new struct blockif_sig_elem bse on the stack and add it to the global blockif_bse_head (using atomic_cmpset_ptr). It will then send a SIGCONT to the be_tid with pthread_kill(be->be_tid, SIGCONT); and then block waiting for the embedded bse_cond to be signalled and the blockif_sig_elem completed.

So far so good but at this point I lose track of what is going on because the SIGCONT is delivered via kqueue to the mevent thread and the blockif_sigcont_handler callback and not to the blkio thread. The callback handler does nothing other than walk the global list marking each blockif_sig_elem complete and kicking the corresponding bse_cond (which wakes the ioemu thread). It takes no action WRT the blkio thread.

The only way I can see this working on FreeBSD is that receiving the SIGCONT causes the system call which the blkio thread is currently in to return with EINTR (or similar) while delivering the actual signal to another thread via the kevent. This has a subtle dependency on the ordering of the events (the system call must return before the signal handler callback is called) and is not something which is made clear in any of the documentation I've been able to find.

I'm also not sure what happens if the blkio thread is merely on the way to calling the system call at the point where the cancellation signal arrives. Seems like it would block when it actually made the call? Might be harmless (since things rely on these things returning via their normal return path to signal the error) or might result in things not being cancelled as expected.

This needs more thought and investigation, I shall ask on the Bhyve list hence setting down my understanding here.

The text was updated successfully, but these errors were encountered:

ijc · 2016-12-12T12:21:53Z

Other possibly useful refs:

ijc · 2016-12-12T12:29:29Z

Thread on FreeBSD virt list: https://lists.freebsd.org/pipermail/freebsd-virtualization/2016-December/005020.html

rn mentioned this issue Mar 12, 2018

Docker for Mac hangs constantly docker/for-mac#1835

Closed

djs55 mentioned this issue Feb 21, 2020

virtio-blk: implement DISCARD #272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in AHCI stop blockif_cancel due to lack of SIGCONT signal delivery #94

Deadlock in AHCI stop blockif_cancel due to lack of SIGCONT signal delivery #94

ijc commented Dec 12, 2016

ijc commented Dec 12, 2016

ijc commented Dec 12, 2016

Deadlock in AHCI stop blockif_cancel due to lack of SIGCONT signal delivery #94

Deadlock in AHCI stop blockif_cancel due to lack of SIGCONT signal delivery #94

Comments

ijc commented Dec 12, 2016

ijc commented Dec 12, 2016

ijc commented Dec 12, 2016