-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud-init 24.4 prevents Ubuntu 20.04 Digital Ocean droplet from booting #5945
Comments
@crankycyclops Please include the required logs in the bug report. Also you sure that3 the version 10.4 is correct? That doesn't seem right to me. |
Sorry, I meant 24.4. I updated the title of the issue. I also added the logs from that day. |
yup, same here. All of my Ubuntu servers cant boot normally, and get frozen on cloud-init stage. And I've seen on many forums the same. Example: https://forum.virtualmin.com/t/cloud-init-network-stage-hangs/131292/15 Downgrading to cloud-init=23.1.2-0ubuntu0~20.04.2 solves the issue. And I've frozen that package on all of my servers for now. |
We had a not working DNS, so I started digging around, and realized, that we did hat networkd activated. It seems that cloud-init did this, but actually failed to bring up the network in the first attempt, leading to resolved not working properly. about the logs: I won't publish them online, as there are some public IPs and stuff in there (dating back one year!), which i won't like to share. Maybe I can recreate the issue on a non public "younger" machine. Not sure, when I'll find the time. However, I think I can share an excerpt from the cloud-init logs:
and systemd-networkd-wait-online.service:
and
feel free to ask about additional specific log entries. Best, Andreas |
Thanks for the log snippets. Would it be possible to get the results of
along with the contents of
and then attach to this issue? This shouldn't contain any sensitive data unless it was manually written there. |
This is fairly difficult to test, but I /think/ that this all comes down to e30549e Since prior it was tagged as an The thing is its fairly difficult to trigger, since meeting the requirements to trigger the check itself are fairly stringent. Azure for ex only uses Now what is somewhat interesting is that it looks like the installer .isos (server and desktop) leave di-intentify.cfg in images installed via that method that will force c-i to run even if no ds is found, which would trigger this issue if there was /also/ a network problem that cause w-o to fail. That isn't terribly hard to do though, as just having a disabled nic or something w/o letting netplan know would cause dhcp waiting period to trigger a timeout/exit, then killing c-i and hanging the instance. All this being said... I don't have a great idea for a solution. As a hacky fix, just changing the systemctl start to ignore the exit code so c-i doesn't fail would work I think? I'd need to do some more testing. |
A reproducer or full journal logs + cloud-init logs would make this much easier to debug.
Did you observe that? The subiquity autoinstaller disables cloud-init on first boot with
That wouldn't be reliable. The whole point of calling that binary was to enable cloud-init to run earlier in boot, but only when safe to do so. If this command fails, all bets are off on the "safe to do so" part. Cloud-init could always loop checking the systemd service status when the command fails, but that would obviously not be a great solution. |
That's essentially what cloud-init already does. We log an error and the process will return 1,but a failure of s-n-w-o doesn't cause a cloud-init crash or early exit. In the DO case, I have a hunch (but no proof yet) that instances that don't come up are experiencing a dependency loop in systemd. DO's vendor-data writes out a network service file and performs a systemd daemon-reload. That could theoretically be problematic, but I can't reproduce the behavior on a fresh instance. It's likely that only older instances are affected because of the image configuration or vendor data that was used at the time (the c-i logs pasted here are using a deprecated datasource). |
I might try to get the logs tomorrow, kinda busy atm. They all have a network interface configured with netplan, and the IP's are set manually, with IPv6/DHCP4/DHCP6 disabled, and using public facing addresses. |
Dear all, please find attached the logs. I added the journal log since boot, and redacted some information (boot.log.tar.gz).
If you could provide me another way of sharing logs except a public GitHub I am happy to do so.
A word on the network manager side: Normally we use neither netplan, nor any of eni, systemd-netword or NetworkManager on our side. I think the network configuration is coming from cloud-init. However, on our side it seems cloud-init tries to start networkd and fails, which leads to resolved failing, and DNS not working. However, the network seemed to be configured correctly. Best, Andreas |
Bug report
On Digital Ocean, after updating cloud-init from 24.3.1 to 24.4 on two Ubuntu 20.04 droplets and rebooting, the boot process got stuck in init, and I couldn't login in even through the recovery console. I had to boot into recovery mode and re-install 24.3.1 over 24.4 to get it working again.
Steps to reproduce the problem
I don't know if this affects all Ubuntu 20.04 droplets or not, but it killed two of mine.
Temporary workaround: launch the recovery CD, chroot into the mounted filesystem, re-install 24.3.1 over 24.4, then boot the droplet as normal. With the old version, the droplet boots up normally again.
Environment details
cloud-init logs
The text was updated successfully, but these errors were encountered: