Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race btw blkid and destroy_vbd_frontend can cause hang #41

Open
zultron opened this issue Oct 29, 2013 · 4 comments
Open

race btw blkid and destroy_vbd_frontend can cause hang #41

zultron opened this issue Oct 29, 2013 · 4 comments
Labels

Comments

@zultron
Copy link

zultron commented Oct 29, 2013

On EL6:

When building a PV VM with pygrub, create_vbd_frontend attaches the VM's boot block device to dom0 for pygrub to operate on. This triggers udev to start blkid.

If blkid does not finish before pygrub, destroy_vbd_frontend will fail to close the device, since blkid is holding it open.

After this, bad stuff. The task will hang, the vdi will remain attached to the dom0, the blkid process can't be killed, and a reboot is required, but the reboot process hangs when stopping the 'blk-availability' service, so the host must be power cycled.

The following links suggest running something like 'udevadm settle', which will wait for the udev event queue to empty, and then exit:

https://www.redhat.com/archives/libguestfs/2012-February/msg00023.html

https://rwmj.wordpress.com/2012/01/19/udev-unexpectedness/#content

For a cheap hack, I added this to the end of the pygrub script, and the problem seems to have disappeared. Of course pygrub isn't the right place for this, but I'm not sure what is. The above links suggest it's possible to run 'udevadm settle' too early before the event is placed in the udev queue, so perhaps it should be in destroy_vbd_frontend.

@zultron
Copy link
Author

zultron commented Oct 29, 2013

djs55 encountered the same issue, as he describes in this now non-existent ticket:

http://webcache.googleusercontent.com/search?q=cache:rSRkkPeQ0FoJ:https://github.com/djs55/xenopsd/issues/30+

Here's a copy in case google's cache entry expires:

djs55 opened this issue 4 months ago
qdisk: "Device in use; refusing to close" triggers segfault
No milestone
No one is assigned

It looks like qemu isn't resilient to the guest writing error nodes in xenstore:

Jul  1 13:48:10 st30 xenopsd-xenlight: [xenops] xenstore-write /local/domain/0/backend/qdisk/0/51792/online = 0
Jul  1 13:48:10 st30 xenopsd-xenlight: [xenops] Device.del_device setting backend to Closing
Jul  1 13:48:10 st30 xenopsd-xenlight: [xenops] Device.Generic.clean_shutdown_wait frontend (domid=0 | kind=vbd | devid=51792); backend (domid=0 | kind=qdisk | devid=51792)
Jul  1 13:48:10 st30 kernel: vbd vbd-51792: 16 Device in use; refusing to close
Jul  1 13:48:10 st30 kernel: qemu-system-i38[1563]: segfault at 878 ip 00007fd007514edf sp 00007fff97753850 error 6 in qemu-system-i386[7fd00749c000+309000]

@zultron
Copy link
Author

zultron commented Oct 30, 2013

Well, this time around it occurred when a VM whose install failed before the disk was partitioned rebooted. Bootloader was set to 'pygrub' from 'eliloader', and pygrub failed, unable to find the partition. The difference this time is there was not hung 'blkid' process, or anything holding the device open identifiable by 'lsof'.

@djs55
Copy link
Collaborator

djs55 commented Oct 30, 2013

My preferred long-term fix for this is to avoid attaching the device to dom0, and use a userspace app to read it, possibly via the NBD protocol talking to tapdisk or qemu.

In the short term a 'udevadm settle' like you suggest sounds good. I think it should live in xenopsd just before we attempt to unplug the device, probably here:

https://github.com/xapi-project/xenopsd/blob/master/xc/device.ml#L147

and

https://github.com/xapi-project/xenopsd/blob/master/xl/xenops_server_xenlight.ml#L859

What do you think, @robhoes?

@robhoes
Copy link
Member

robhoes commented Nov 4, 2013

@djs55 Yes, that sounds good to me.

psafont pushed a commit to psafont/xenopsd that referenced this issue Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants