-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky timeout during vagrant up - Timed out while waiting for the machine to boot #13062
Comments
i too have sporadic boot failures. |
PS: i have tested this on four systems, all with ample resources like AMD 5950x with 64GB memory |
PS: the guest OS i am using is Ubuntu 20.04 |
BTW: this has plagued me for months, if not years. |
Hi @nstng and @appst - sorry to hear about the sporadic failures you're seeing. Sporadic issues are always tough to debug, so more information will be helpful for us to narrow it down. Can either of you share a minimal vagrantfile that reproduces the timeouts for you and/or the debug output from one of the timeouts? |
HI Paul,
I am in the midst of doing further testing on this and I will get back to
you where it all ends up.
Thanks!
Rolande Kendal
…On Fri, 20 Jan 2023 at 14:18, Paul Hinze ***@***.***> wrote:
Hi @nstng <https://github.com/nstng> and @appst <https://github.com/appst>
- sorry to hear about the sporadic failures you're seeing. Sporadic issues
are always tough to debug, so more information will be helpful for us to
narrow it down. Can either of you share a minimal vagrantfile that
reproduces the timeouts for you and/or the debug output from one of the
timeouts?
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z6OQ2KTC65HF5WTZOTWTLQGRANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @appst, thank you - I appreciate if you take over providing a minimal example. As said, our setup is complex and it would take me some time to cut out unnecessary parts and test them. |
Hi Nils,
My setup is rather complex too, but I am working diligently towards
narrowing down my problem right now.
I am days into trying to simplify and isolate things, and I am still
looking for the smoking gun.
My testing is basically running hundreds of iterations of launching a
vagrant vm.
What i have noticed so far, in my environment, is that a generic box from
hashicorp or canonical do not hang,
while what i build with Packer eventually hangs 2, or 10, or 50, or so
iterations in.
I don't know if that relates to the problems you are experiencing.
i will let you know where things lead.
Kendal
…On Mon, 23 Jan 2023 at 03:41, Nils Semmelrock ***@***.***> wrote:
Hi @appst <https://github.com/appst>, thank you - I appreciate if you
take over providing a minimal example. As said, our setup is complex and it
would take me some time to cut out unnecessary parts and test them.
Let me know if I should take over again.
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2ZZU6LMO3JZU5CXVZ2DWTY727ANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have similar problems, but mostly only when starting multiple VMs at around the same time. I am building base boxes for each minor release of our own software we ever did, using a GNU make infrastructure, and doing a non-parallel build this usually succeeds for all 350+ VMs, but obviously takes its time. When utilizing all cores, e.g. with "make -j8" on my 8 core AMD Ryzen desktop machine, I'll only see one or two VM boxes failing to build on a good day, which then succeed in a second run. On a bad day it is more like one in ten or more failing, and it sometimes takes running "make" three or four times until all boxes have been rebuilt. I've tried to add some random delay of 0 to 20 seconds at the start of the actual build script invoked by "make", so that "vagrant up" of multiple machines is less likely to happen at the very same second, and that seems to have improved the situation a little bit, but not fully ... I can reproduce this on three different machines, one AMD Ryzen 9, one AMD Ryzen 7, and one Intel Core i7, all running latest Vagrant and VirtualBox by now, on either Ubuntu 20.04 or 22.04. Guest OSes mostly being different Ubuntu LTS releases, and a very small number of other Linux distributions (Debian, OpenSuse, CentOS, Alma and RockyLinux) |
Hi Paul,
I have been diligently trying to dig to the bottom of this issue.
My recent tests have launched into the thousands of instances.
Note that I am talking about serially launched, with the previous one fully
shut down before the next begins.
I posted my findings thus far here:
https://forums.virtualbox.org/viewtopic.php?f=7&t=108454
I'm not confident it is the fault of virtualbox however, but perhaps how
vagrant uses virtualbox.
The reason I say that is because my tests include launching instances with
VBoxManga alone and they don't hang.
It is when I launch the box, which incorporates the same vmdk file, through
vagrant that the hanging begins.
Sometimes it hangs on the first or second instance. Sometimes it won't
happen until the fiftieth, or so, instance.
As a pure shot-in-the-dark question... the following line in the log of a
failed instance is as follows...
00:00:05.923578 GIM: KVM: VCPU 0: Enabled system-time struct.
I just know that an instance with zero cpus allotted to it would not boot
and that perhaps another line with "VCPU 1" should appear in the logs, but
does not.
I thank you for your interest in this, and let me know if I can help from
my end.
kendal
…On Fri, 20 Jan 2023 at 14:18, Paul Hinze ***@***.***> wrote:
Hi @nstng <https://github.com/nstng> and @appst <https://github.com/appst>
- sorry to hear about the sporadic failures you're seeing. Sporadic issues
are always tough to debug, so more information will be helpful for us to
narrow it down. Can either of you share a minimal vagrantfile that
reproduces the timeouts for you and/or the debug output from one of the
timeouts?
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z6OQ2KTC65HF5WTZOTWTLQGRANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Kendal, Thanks for the continued updates on your testing. You're asking the same question that was in my mind about the hangs - is this at the VirtualBox layer or the Vagrant layer. The difference in behavior between vanilla This is also quite interesting:
I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer config) that reproduces the issue so we could try and reproduce the hang on our side. As for your question about the Thanks for all your work on this so far! |
Paul,
The different log clippings I pointed out are from running "vagrant up"
There is nothing to report using vanilla VBoxManage directly since that
would not hang.
My build process is currently riddled with tests and changes for this issue.
I currently have all four of my test systems running without a hang for the
last hour.
That has never happened before.
So, I may have isolated the issue.
The most fundamental change is an epiphany I had about the box.ovf file.
My box files are built in these stages
1) I make a box with Packer and add it to the local Vagrant install
2) I create and instance with the Packer box and further provision it
3) I export the box with "VBoxManage <name> --export"
4) I tar up the exported files into a box file
I was under the assumption that the box.ovf file was static- I was not
aware of the fact that VBoxManage exported a different copy that was
updated by Virtualbox through the provisioning that I did outside of
Packer. I was thinking it was just the vmdk file that would change. In my
environment, box.ovf was being updated to include kvm paravirtualization. I
was then launching an instance with that box on systems that have no kvm.
If this is the real problem, then horay! I can clean-up and move on.
However, If that's the case then it is curious how instantiations on other
systems mostly worked rather than consistently failing due to bad
configuration. It is also curious how I was noticing the hangs on the same
system that the box was generated on that has kvm. No cigars yet!
I am now using the original Packer box.ovf in my final box file. I have a
lot of clean-up to do to know whether or not that change alone made the
difference.
I greatly appreciate your willingness to help out with this. In a few days,
after I clean things up, I will let you know my final outcome.
Perhaps my error with box.ovf may have aggravated things to cause an issue
that you want to look into anyway - it seemed like the kvm handling could
have been inconsistent in all iterations of my testing. If you wish to
pursue that then let me know and I will send you whatever I can to help.
Cheerz!
Kendal
…On Fri, 27 Jan 2023 at 14:34, Paul Hinze ***@***.***> wrote:
Hi Kendal,
Thanks for the continued updates on your testing. You're asking the same
question that was in my mind about the hangs - is this at the VirtualBox
layer or the Vagrant layer. The difference in behavior between vanilla
VBoxManage and vagrant invocations of the same VMDK is interesting.
This is also quite interesting:
What i have noticed so far, in my environment, is that a generic box from
hashicorp or canonical do not hang,
while what i build with Packer eventually hangs 2, or 10, or 50, or so
iterations in.
I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer
config) that reproduces the issue so we could try and reproduce the hang on
our side.
As for your question about the KVM: VCPU 0 line: I'm pretty sure that
VCPUs are zero-indexed, so I would expect that to be a line referencing the
first VCPU. It's curious that the line does not show up in your successful
runs though.
Thanks for all your work on this so far!
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z4Q3UAPXRYIT6RCSULWUQPKRANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, the following is not really a minimal example, but some statistics what we see in our setup. setupThe following runs are based on a test workflow on my fork, where
runs
|
Hi Paul,
The majority of my Vagrant provisioning tests have involved the following...
vagrant init mybox
do {
vagrant up
sleep 60
vagrant halt -f
sleep 30
vagrant destroy -f
sleep 30
} while (true)
What I have noticed is that boxes built with Virtualbox --firmware=bios
fail to boot occasionally (say after 10, 20, 50, or so, tries),
meanwhile boxes built with Virtualbox --firmware=efi have never failed yet,
and it has been a couple of thousand iterations I have done so far.
I am going to use the efi firmware from now on and see where that leads.
Kendal
…On Fri, 27 Jan 2023 at 14:34, Paul Hinze ***@***.***> wrote:
Hi Kendal,
Thanks for the continued updates on your testing. You're asking the same
question that was in my mind about the hangs - is this at the VirtualBox
layer or the Vagrant layer. The difference in behavior between vanilla
VBoxManage and vagrant invocations of the same VMDK is interesting.
This is also quite interesting:
What i have noticed so far, in my environment, is that a generic box from
hashicorp or canonical do not hang,
while what i build with Packer eventually hangs 2, or 10, or 50, or so
iterations in.
I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer
config) that reproduces the issue so we could try and reproduce the hang on
our side.
As for your question about the KVM: VCPU 0 line: I'm pretty sure that
VCPUs are zero-indexed, so I would expect that to be a line referencing the
first VCPU. It's curious that the line does not show up in your successful
runs though.
Thanks for all your work on this so far!
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z4Q3UAPXRYIT6RCSULWUQPKRANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have the same problem with the same box : |
My impressions are that it is not a box issue.
It is my current feeling that this is an issue with how Vagrant handles
Virtualbox.
The reason I say so is that my testing of pure Virtualbox with my.vmdk does
not fail,
Using the same my.vmdk, my testing with Vagrant using Virtualbox using
Hyper-V DOES NOT fail,
But my testing with Vagrant using Virtualbox not using Hyper-V DOES
eventually fail.
I reserve the right to change my mind later. ;-)
…On Sat, 18 Mar 2023 at 12:52, Hicham1se ***@***.***> wrote:
I have the same problem with the same box : ubuntu/focal64.
I think if everyone can specify the box they're working with, we can
understand more about this vm behavior.
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z5EITWISLN65RK4GFDW4XR35ANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm still working on it, now I'm try to force a manual password prompt
directly on vm but, now I'm starting to think it's vm's problem. But let's
give it some time to come clear.
And I reserve the right to change my mind as well. B-)
…On Sun, Mar 19, 2023, 1:03 PM r kendal ***@***.***> wrote:
My impressions are that it is not a box issue.
It is my current feeling that this is an issue with how Vagrant handles
Virtualbox.
The reason I say so is that my testing of pure Virtualbox with my.vmdk does
not fail,
Using the same my.vmdk, my testing with Vagrant using Virtualbox using
Hyper-V DOES NOT fail,
But my testing with Vagrant using Virtualbox not using Hyper-V DOES
eventually fail.
I reserve the right to change my mind later. ;-)
On Sat, 18 Mar 2023 at 12:52, Hicham1se ***@***.***> wrote:
> I have the same problem with the same box : ubuntu/focal64.
> I think if everyone can specify the box they're working with, we can
> understand more about this vm behavior.
>
> —
> Reply to this email directly, view it on GitHub
> <
#13062 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AANF2Z5EITWISLN65RK4GFDW4XR35ANCNFSM6AAAAAAT4QBH3Y
>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXXZS6LOHB4XGQD6SBIBOMLW45J7VANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too. |
i think the box is a red herring. it is my belief that any box will
eventually fail with vagrant using virtualbox and its native hypervisor
…On Sun, 19 Mar 2023 at 15:40, Hartmut Holzgraefe ***@***.***> wrote:
I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at
least back to ubuntu/trusty64, and also with centos/6 and centos/7 base
boxes, too.
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z5AVVNWFEWA7BWMI53W45OLDANCNFSM6AAAAAAT4QBH3Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have been having the same kind of issue with all ubuntu/64 boxes. The only ubuntu/64 box working for me is the ubuntu/jammy64 one. If you found any kind of solution, please let me know.. |
Unfortunately I have never solved this issue.
Currently, in development, I live with restarting an occasional crashed
instance.
For production, I know I have to migrate away from virtualbox/vagrant.
If you have any better luck then I would be happy to hear about it!
cheerz
…On Fri, 6 Oct 2023 at 05:21, Om Kathare ***@***.***> wrote:
I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at
least back to ubuntu/trusty64, and also with centos/6 and centos/7 base
boxes, too.
I have been having the same kind of issue with all ubuntu/64 boxes. The
only ubuntu/64 box working for me is the ubuntu/jammy64 one. If you found
any kind of solution, please let me know..
—
Reply to this email directly, view it on GitHub
<#13062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANF2Z3W5RYV6I5XCTK3FZLX57ERNAVCNFSM6AAAAAAT4QBH32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQGI3DMNJSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, I found a lot of issues regarding "Timed out while waiting for the machine to boot" - but none that really fits our current problem. As our setup is maybe a little bit special, I mostly hope for hints how to debug this.
We are using vagrant
2.3.4
with VirtualBox6.1.38r153438
on macOS12.6.2
in GitHub CI (i.e., specs are form here).In our workflow we bring up multiple VMs. The guests are based on modified ubuntu focal base images that we load from the vagrant cloud. In ~ one out of ten runs we get (i.e., this issue is flaky)
The timeout happens after the default 5 Minutes.
In runs without this issue we see that the ssh auth (boot) is successful in <1 Minute, i.e., just waiting longer might not be the solution here. Example:
For testing/debugging we implemented a re-try mechanic -> try max three times to build the vm - in case of a failure
vagrant destroy
the vm and try again. If the problem occurs with this, then we actually get the timeout three times - makes me think that something in the environment is causing this.Any hints for debugging this are appreciated. Let me know what additional information can be helpful.
Expected behavior
Booting the VMs successfully in all CI runs.
Actual behavior
~one of out ten failures as described above
Reproduction information
Vagrant version
vagrant
2.3.4
with VirtualBox6.1.38r153438
Host operating system
macOS
12.6.2
in GitHub CIGuest operating system
Ubuntu focal (modified base image)
Steps to reproduce
See above
Vagrantfile
will provide if answers suggests that this is relevant
The text was updated successfully, but these errors were encountered: