Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky timeout during vagrant up - Timed out while waiting for the machine to boot #13062

Open
nstng opened this issue Jan 16, 2023 · 21 comments
Open

Comments

@nstng
Copy link

nstng commented Jan 16, 2023

Hi, I found a lot of issues regarding "Timed out while waiting for the machine to boot" - but none that really fits our current problem. As our setup is maybe a little bit special, I mostly hope for hints how to debug this.

We are using vagrant 2.3.4 with VirtualBox 6.1.38r153438 on macOS 12.6.2 in GitHub CI (i.e., specs are form here).

In our workflow we bring up multiple VMs. The guests are based on modified ubuntu focal base images that we load from the vagrant cloud. In ~ one out of ten runs we get (i.e., this issue is flaky)

my_vm: SSH auth method: private key
Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.

The timeout happens after the default 5 Minutes.

In runs without this issue we see that the ssh auth (boot) is successful in <1 Minute, i.e., just waiting longer might not be the solution here. Example:

Fri, 13 Jan 2023 16:43:36 GMT     my_vm: SSH auth method: private key
Fri, 13 Jan 2023 16:44:08 GMT ==> my_vm: Machine booted and ready!

For testing/debugging we implemented a re-try mechanic -> try max three times to build the vm - in case of a failure vagrant destroy the vm and try again. If the problem occurs with this, then we actually get the timeout three times - makes me think that something in the environment is causing this.

Any hints for debugging this are appreciated. Let me know what additional information can be helpful.

Expected behavior

Booting the VMs successfully in all CI runs.

Actual behavior

~one of out ten failures as described above

Reproduction information

Vagrant version

vagrant 2.3.4 with VirtualBox 6.1.38r153438

Host operating system

macOS 12.6.2 in GitHub CI

Guest operating system

Ubuntu focal (modified base image)

Steps to reproduce

See above

Vagrantfile

will provide if answers suggests that this is relevant

@appst
Copy link

appst commented Jan 19, 2023

i too have sporadic boot failures.
i have written a bash script that loops through a minimal vm spec with "vagrant up", "vagrant halt", and "vagrant destroy -f".
i added time delays between all stages.
most of the time it works, but it always eventually ends in failure.
a similar bash script using only virtualbox commands to the same affect does not fail.
the gui is completely black, which i believe indicates that the vm is not even booting.
debug traces have not enlightened me.
i believe there may be a bug lurking in there somewhere.

@appst
Copy link

appst commented Jan 19, 2023

PS: i have tested this on four systems, all with ample resources like AMD 5950x with 64GB memory
three of the systems use Windows 10
one uses Ubuntu 20.04.1 (KDE)
all use Virtualbox 6.1.40r154048 and Vagrant 2.3.4
all eventually fail using Vagrant in a loop

@appst
Copy link

appst commented Jan 19, 2023

PS: the guest OS i am using is Ubuntu 20.04
that probably is not relevant
the important thing is that it works most time, but eventually fails

@appst
Copy link

appst commented Jan 19, 2023

BTW: this has plagued me for months, if not years.
i have always "fixed" it by simply re-provisioning a failed attempt
lately, i am trying to fully automate Vagrant, and thus have been doing this testing.
this needs to be resolved!

@phinze
Copy link
Contributor

phinze commented Jan 20, 2023

Hi @nstng and @appst - sorry to hear about the sporadic failures you're seeing. Sporadic issues are always tough to debug, so more information will be helpful for us to narrow it down. Can either of you share a minimal vagrantfile that reproduces the timeouts for you and/or the debug output from one of the timeouts?

@appst
Copy link

appst commented Jan 22, 2023 via email

@nstng
Copy link
Author

nstng commented Jan 23, 2023

Hi @appst, thank you - I appreciate if you take over providing a minimal example. As said, our setup is complex and it would take me some time to cut out unnecessary parts and test them.
Let me know if I should take over again.

@appst
Copy link

appst commented Jan 23, 2023 via email

@hholzgra
Copy link

I have similar problems, but mostly only when starting multiple VMs at around the same time.

I am building base boxes for each minor release of our own software we ever did, using a GNU make infrastructure, and doing a non-parallel build this usually succeeds for all 350+ VMs, but obviously takes its time.

When utilizing all cores, e.g. with "make -j8" on my 8 core AMD Ryzen desktop machine, I'll only see one or two VM boxes failing to build on a good day, which then succeed in a second run. On a bad day it is more like one in ten or more failing, and it sometimes takes running "make" three or four times until all boxes have been rebuilt.

I've tried to add some random delay of 0 to 20 seconds at the start of the actual build script invoked by "make", so that "vagrant up" of multiple machines is less likely to happen at the very same second, and that seems to have improved the situation a little bit, but not fully ...

I can reproduce this on three different machines, one AMD Ryzen 9, one AMD Ryzen 7, and one Intel Core i7, all running latest Vagrant and VirtualBox by now, on either Ubuntu 20.04 or 22.04.

Guest OSes mostly being different Ubuntu LTS releases, and a very small number of other Linux distributions (Debian, OpenSuse, CentOS, Alma and RockyLinux)

@appst
Copy link

appst commented Jan 26, 2023 via email

@phinze
Copy link
Contributor

phinze commented Jan 27, 2023

Hi Kendal,

Thanks for the continued updates on your testing. You're asking the same question that was in my mind about the hangs - is this at the VirtualBox layer or the Vagrant layer. The difference in behavior between vanilla VBoxManage and vagrant invocations of the same VMDK is interesting.

This is also quite interesting:

What i have noticed so far, in my environment, is that a generic box from hashicorp or canonical do not hang,
while what i build with Packer eventually hangs 2, or 10, or 50, or so iterations in.

I'd love it if you could share a minimal Vagrantfile and VMDK (or Packer config) that reproduces the issue so we could try and reproduce the hang on our side.

As for your question about the KVM: VCPU 0 line: I'm pretty sure that VCPUs are zero-indexed, so I would expect that to be a line referencing the first VCPU. It's curious that the line does not show up in your successful runs though.

Thanks for all your work on this so far!

@appst
Copy link

appst commented Jan 27, 2023 via email

@nstng
Copy link
Author

nstng commented Feb 2, 2023

Hi, the following is not really a minimal example, but some statistics what we see in our setup.

setup

The following runs are based on a test workflow on my fork, where

  • https://github.com/magma/magma/blob/bef01eb349e294dee1ebccb6a323d89278158b51/lte/gateway/Vagrantfile#L72 is created via vagrant up
    • the virtualbox provider is used
    • ansible provisioning is disabled in my tests
    • this is our "smallest" (least complex) base vm
  • vagrant up/destroy is looped 20 times
    • for each iteration it is tried up to three times to do vagrant up successfully
    • (-> 20 to 60 times vagrant up/destroy)
  • For a successful vagrant up, we can see here that booting the vm takes less than a minute.
  • The host is a GitHub runner using vagrant 2.3.4 with VirtualBox 6.1.38r153438 on macOS 12.6.2 (i.e., specs are form here).

runs

@appst
Copy link

appst commented Feb 7, 2023 via email

@hicham1se
Copy link

I have the same problem with the same box : ubuntu/focal64.
I think if everyone can specify the box they're working with, we can understand more about this vm behavior.

@appst
Copy link

appst commented Mar 19, 2023 via email

@hicham1se
Copy link

hicham1se commented Mar 19, 2023 via email

@hholzgra
Copy link

I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too.

@appst
Copy link

appst commented Mar 20, 2023 via email

@meetAssassin
Copy link

I have been seeing the problem with all kinds of ubuntu/*64 base boxes, at least back to ubuntu/trusty64, and also with centos/6 and centos/7 base boxes, too.

I have been having the same kind of issue with all ubuntu/64 boxes. The only ubuntu/64 box working for me is the ubuntu/jammy64 one. If you found any kind of solution, please let me know..

@appst
Copy link

appst commented Nov 2, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants