Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init 24.4 prevents Ubuntu 20.04 Digital Ocean droplet from booting #5945

Open
crankycyclops opened this issue Jan 3, 2025 · 10 comments
Labels
bug Something isn't working correctly incomplete Action required by submitter invalid This doesn't seem right

Comments

@crankycyclops
Copy link

crankycyclops commented Jan 3, 2025

Bug report

On Digital Ocean, after updating cloud-init from 24.3.1 to 24.4 on two Ubuntu 20.04 droplets and rebooting, the boot process got stuck in init, and I couldn't login in even through the recovery console. I had to boot into recovery mode and re-install 24.3.1 over 24.4 to get it working again.

Steps to reproduce the problem

I don't know if this affects all Ubuntu 20.04 droplets or not, but it killed two of mine.

  1. Use apt-get to update cloud-init to 24.4
  2. Reboot droplet
  3. Observe that the droplet gets stuck in init

Temporary workaround: launch the recovery CD, chroot into the mounted filesystem, re-install 24.3.1 over 24.4, then boot the droplet as normal. With the old version, the droplet boots up normally again.

Environment details

  • Cloud-init version: 24.4
  • Operating System Distribution: Ubuntu 20.04
  • Cloud provider, platform or installer type: Digital Ocean

cloud-init logs

2025-01-03 01:33:53,577 - log_util.py[DEBUG]: Cloud-init v. 24.4-0ubuntu1~20.04.1 running 'init-local' at Fri, 03 Jan 2025 01:33:53 +0000. Up 7.18 seconds.
2025-01-03 01:33:53,578 - main.py[INFO]: PID [1] started cloud-init 'init-local'.
2025-01-03 01:33:53,578 - main.py[DEBUG]: No kernel command line url found.
2025-01-03 01:33:53,578 - main.py[DEBUG]: Closing stdin
2025-01-03 01:33:53,580 - util.py[DEBUG]: Writing to /var/log/cloud-init.log - ab: [640] 0 bytes
2025-01-03 01:33:53,580 - util.py[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 102:4
2025-01-03 01:33:53,580 - util.py[DEBUG]: Reading from /var/lib/cloud/data/python-version (quiet=False)
2025-01-03 01:33:53,581 - util.py[DEBUG]: Reading 3 bytes from /var/lib/cloud/data/python-version
2025-01-03 01:33:53,581 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance/boot-finished
2025-01-03 01:33:53,582 - handlers.py[DEBUG]: start: init-local/check-cache: attempting to read from cache [check]
2025-01-03 01:33:53,582 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2025-01-03 01:33:53,582 - util.py[DEBUG]: Reading 33986 bytes from /var/lib/cloud/instance/obj.pkl
2025-01-03 01:33:53,588 - lifecycle.py[INFO]: DataSourceDigitalOcean is deprecated in 23.2 and scheduled to be removed in 28.2. Deprecated in favour of DataSourceConfigDrive.
2025-01-03 01:33:53,588 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_serial
2025-01-03 01:33:53,588 - stages.py[DEBUG]: restored from checked cache: DataSourceDigitalOcean
2025-01-03 01:33:53,589 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: restored from checked cache: DataSourceDigitalOcean
2025-01-03 01:33:53,589 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg (quiet=False)
2025-01-03 01:33:53,589 - util.py[DEBUG]: Reading 3766 bytes from /etc/cloud/cloud.cfg
2025-01-03 01:33:53,589 - util.py[DEBUG]: Attempting to load yaml from string of length 3766 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,600 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/99-digitalocean.cfg (quiet=False)
2025-01-03 01:33:53,600 - util.py[DEBUG]: Reading 168 bytes from /etc/cloud/cloud.cfg.d/99-digitalocean.cfg
2025-01-03 01:33:53,600 - util.py[DEBUG]: Attempting to load yaml from string of length 168 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,601 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/90_dpkg.cfg (quiet=False)
2025-01-03 01:33:53,601 - util.py[DEBUG]: Reading 257 bytes from /etc/cloud/cloud.cfg.d/90_dpkg.cfg
2025-01-03 01:33:53,601 - util.py[DEBUG]: Attempting to load yaml from string of length 257 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,602 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/90-digitalocean.cfg (quiet=False)
2025-01-03 01:33:53,602 - util.py[DEBUG]: Reading 4276 bytes from /etc/cloud/cloud.cfg.d/90-digitalocean.cfg
2025-01-03 01:33:53,602 - util.py[DEBUG]: Attempting to load yaml from string of length 4276 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,613 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/05_logging.cfg (quiet=False)
2025-01-03 01:33:53,613 - util.py[DEBUG]: Reading 2071 bytes from /etc/cloud/cloud.cfg.d/05_logging.cfg
2025-01-03 01:33:53,613 - util.py[DEBUG]: Attempting to load yaml from string of length 2071 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,617 - util.py[DEBUG]: Reading from /run/cloud-init/cloud.cfg (quiet=False)
2025-01-03 01:33:53,617 - util.py[DEBUG]: Reading 40 bytes from /run/cloud-init/cloud.cfg
2025-01-03 01:33:53,617 - util.py[DEBUG]: Attempting to load yaml from string of length 40 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,617 - util.py[DEBUG]: Attempting to load yaml from string of length 0 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,617 - util.py[DEBUG]: loaded blob returned None, returning default.
2025-01-03 01:33:53,618 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/cloud-config.txt (quiet=False)
2025-01-03 01:33:53,618 - util.py[DEBUG]: Reading 0 bytes from /var/lib/cloud/instance/cloud-config.txt
2025-01-03 01:33:53,618 - util.py[DEBUG]: Attempting to load yaml from string of length 0 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,618 - util.py[DEBUG]: loaded blob returned None, returning default.
2025-01-03 01:33:53,618 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/vendor-cloud-config.txt (quiet=False)
2025-01-03 01:33:53,618 - util.py[DEBUG]: Reading 548 bytes from /var/lib/cloud/instance/vendor-cloud-config.txt
2025-01-03 01:33:53,618 - util.py[DEBUG]: Attempting to load yaml from string of length 548 with allowed root types (<class 'dict'>,)
2025-01-03 01:33:53,622 - util.py[DEBUG]: Writing to /run/cloud-init/cloud-id-digitalocean - wb: [644] 13 bytes
2025-01-03 01:33:53,622 - util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/cloud-id' => '/run/cloud-init/cloud-id-digitalocean'
2025-01-03 01:33:53,623 - atomic_helper.py[DEBUG]: Atomically writing to file /run/cloud-init/instance-data-sensitive.json (via temporary file /run/cloud-init/tmp_kv3vwwl) - w: [600] 10566 bytes/chars
2025-01-03 01:33:53,623 - atomic_helper.py[DEBUG]: Atomically writing to file /run/cloud-init/instance-data.json (via temporary file /run/cloud-init/tmp81ivct2j) - w: [644] 3801 bytes/chars
2025-01-03 01:33:53,624 - stages.py[INFO]: Instance link already exists, not recreating it.
2025-01-03 01:33:53,624 - util.py[DEBUG]: Reading from /var/lib/cloud/instances/113496214/datasource (quiet=False)
2025-01-03 01:33:53,624 - util.py[DEBUG]: Reading 47 bytes from /var/lib/cloud/instances/113496214/datasource
2025-01-03 01:33:53,625 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/113496214/datasource - wb: [644] 47 bytes
2025-01-03 01:33:53,628 - util.py[DEBUG]: Writing to /var/lib/cloud/data/previous-datasource - wb: [644] 47 bytes
2025-01-03 01:33:53,628 - util.py[DEBUG]: Reading from /var/lib/cloud/data/instance-id (quiet=False)
2025-01-03 01:33:53,628 - util.py[DEBUG]: Reading 10 bytes from /var/lib/cloud/data/instance-id
2025-01-03 01:33:53,628 - stages.py[DEBUG]: previous iid found to be 113496214
2025-01-03 01:33:53,629 - util.py[DEBUG]: Writing to /var/lib/cloud/data/instance-id - wb: [644] 10 bytes
2025-01-03 01:33:53,629 - util.py[DEBUG]: Writing to /run/cloud-init/.instance-id - wb: [644] 10 bytes
2025-01-03 01:33:53,629 - util.py[DEBUG]: Writing to /var/lib/cloud/data/previous-instance-id - wb: [644] 10 bytes
2025-01-03 01:33:53,630 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 33986 bytes
2025-01-03 01:33:53,630 - stages.py[INFO]: Not re-loading configuration, instance id and datasource have not changed.
2025-01-03 01:33:53,630 - main.py[DEBUG]: [local] init will now be targeting instance id: 113496214. new=False
2025-01-03 01:33:53,631 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2025-01-03 01:33:53,632 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2025-01-03 01:33:53,632 - util.py[DEBUG]: Reading 65 bytes from /var/lib/cloud/data/set-hostname
2025-01-03 01:33:53,632 - cc_set_hostname.py[DEBUG]: No hostname changes. Skipping set_hostname
2025-01-03 01:33:53,633 - util.py[DEBUG]: Reading from /sys/class/net/lo/address (quiet=False)
2025-01-03 01:33:53,633 - util.py[DEBUG]: Reading 18 bytes from /sys/class/net/lo/address
2025-01-03 01:33:53,633 - util.py[DEBUG]: Reading from /sys/class/net/eth0/address (quiet=False)
2025-01-03 01:33:53,633 - util.py[DEBUG]: Reading 18 bytes from /sys/class/net/eth0/address
2025-01-03 01:33:53,634 - util.py[DEBUG]: Reading from /var/lib/cloud/hotplug.enabled (quiet=False)
2025-01-03 01:33:53,634 - util.py[DEBUG]: File not found: /var/lib/cloud/hotplug.enabled
2025-01-03 01:33:53,634 - stages.py[DEBUG]: Allowed events: {<EventScope.NETWORK: 'network'>: {<EventType.BOOT_NEW_INSTANCE: 'boot-new-instance'>}}
2025-01-03 01:33:53,634 - stages.py[DEBUG]: Event Denied: scopes=['network'] EventType=boot
2025-01-03 01:33:53,634 - util.py[DEBUG]: Reading from /var/lib/cloud/hotplug.enabled (quiet=False)
2025-01-03 01:33:53,634 - util.py[DEBUG]: File not found: /var/lib/cloud/hotplug.enabled
2025-01-03 01:33:53,634 - stages.py[DEBUG]: Allowed events: {<EventScope.NETWORK: 'network'>: {<EventType.BOOT_NEW_INSTANCE: 'boot-new-instance'>}}
2025-01-03 01:33:53,634 - stages.py[DEBUG]: Event Denied: scopes=['network'] EventType=boot-legacy
2025-01-03 01:33:53,634 - stages.py[DEBUG]: No network config applied. Neither a new instance nor datasource network update allowed
2025-01-03 01:33:53,634 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'type': 'physical', 'mac_address': '1e:f4:12:49:d9:ed', 'name': 'eth0', 'subnets': [{'type': 'static', 'control': 'auto', 'address': '159.89.146.229', 'gateway': '159.89.144.1', 'netmask': '255.255.240.0'}, {'type': 'static', 'control': 'auto', 'address': '2604:A880:0002:00D0:0000:0000:21AF:8001/64', 'gateway': '2604:a880:0002:00d0:0000:0000:0000:0001'}, {'type': 'static', 'control': 'auto', 'address': '10.46.0.6', 'netmask': '255.255.0.0'}]}, {'type': 'nameserver', 'address': ['67.207.67.3', '67.207.67.2']}]}
2025-01-03 01:33:53,634 - util.py[DEBUG]: Reading from /sys/class/net/eth0/device/device (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 7 bytes from /sys/class/net/eth0/device/device
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/lo/addr_assign_type (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 2 bytes from /sys/class/net/lo/addr_assign_type
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/lo/uevent (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 23 bytes from /sys/class/net/lo/uevent
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/lo/address (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 18 bytes from /sys/class/net/lo/address
2025-01-03 01:33:53,635 - net[DEBUG]: ovs-vsctl not in PATH; not detecting Open vSwitch interfaces
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/lo/device/device (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/eth0/addr_assign_type (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 2 bytes from /sys/class/net/eth0/addr_assign_type
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/eth0/uevent (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 25 bytes from /sys/class/net/eth0/uevent
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading from /sys/class/net/eth0/address (quiet=False)
2025-01-03 01:33:53,635 - util.py[DEBUG]: Reading 18 bytes from /sys/class/net/eth0/address
2025-01-03 01:33:53,636 - util.py[DEBUG]: Reading from /sys/class/net/eth0/device/device (quiet=False)
2025-01-03 01:33:53,636 - util.py[DEBUG]: Reading 7 bytes from /sys/class/net/eth0/device/device
2025-01-03 01:33:53,636 - util.py[DEBUG]: Reading from /sys/class/net/lo/operstate (quiet=False)
2025-01-03 01:33:53,636 - util.py[DEBUG]: Reading 8 bytes from /sys/class/net/lo/operstate
2025-01-03 01:33:53,636 - util.py[DEBUG]: Reading from /sys/class/net/eth0/operstate (quiet=False)
2025-01-03 01:33:53,636 - util.py[DEBUG]: Reading 5 bytes from /sys/class/net/eth0/operstate
2025-01-03 01:33:53,636 - subp.py[DEBUG]: Running command ['ip', '-6', 'addr', 'show', 'permanent', 'scope', 'global'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 01:33:53,644 - subp.py[DEBUG]: Running command ['ip', '-4', 'addr', 'show'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 01:33:53,648 - net[DEBUG]: Detected interfaces {'lo': {'downable': False, 'device_id': None, 'driver': None, 'mac': '00:00:00:00:00:00', 'name': 'lo', 'up': True}, 'eth0': {'downable': True, 'device_id': '0x0001', 'driver': 'virtio_net', 'mac': '1e:f4:12:49:d9:ed', 'name': 'eth0', 'up': False}}
2025-01-03 01:33:53,648 - net[DEBUG]: no work necessary for renaming of [['1e:f4:12:49:d9:ed', 'eth0', 'virtio_net', '0x0001']]
2025-01-03 01:33:53,648 - main.py[DEBUG]: Network connectivity determined necessary for cloud-init's network stage. Reason: non-cloud-config user data found in vendor data
2025-01-03 01:33:53,648 - main.py[DEBUG]: [local] Exiting. datasource DataSourceDigitalOcean not in local mode.
2025-01-03 01:33:53,649 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
2025-01-03 01:33:53,649 - util.py[DEBUG]: Reading 10 bytes from /proc/uptime
2025-01-03 01:33:53,649 - atomic_helper.py[DEBUG]: Atomically writing to file /var/lib/cloud/data/status.json (via temporary file /var/lib/cloud/data/tmpy6d7i2ws) - w: [644] 503 bytes/chars
2025-01-03 01:33:53,649 - performance.py[DEBUG]: cloud-init stage: 'init-local' took 0.194 seconds
2025-01-03 01:33:53,650 - handlers.py[DEBUG]: finish: init-local: SUCCESS: searching for local datasources
2025-01-03 01:33:54,233 - log_util.py[DEBUG]: Cloud-init v. 24.4-0ubuntu1~20.04.1 running 'init' at Fri, 03 Jan 2025 01:33:54 +0000. Up 7.84 seconds.
2025-01-03 01:33:54,233 - main.py[INFO]: PID [1] started cloud-init 'init'.
2025-01-03 01:33:54,233 - main.py[DEBUG]: No kernel command line url found.
2025-01-03 01:33:54,233 - main.py[DEBUG]: Closing stdin
2025-01-03 01:33:54,236 - util.py[DEBUG]: Writing to /var/log/cloud-init.log - ab: [640] 0 bytes
2025-01-03 01:33:54,236 - util.py[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 102:4
2025-01-03 01:33:54,236 - util.py[DEBUG]: Reading from /var/lib/cloud/data/python-version (quiet=False)
2025-01-03 01:33:54,236 - util.py[DEBUG]: Reading 3 bytes from /var/lib/cloud/data/python-version
2025-01-03 01:33:54,237 - main.py[DEBUG]: Will wait for network connectivity before continuing
2025-01-03 01:33:54,238 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2025-01-03 01:33:54,239 - activators.py[DEBUG]: Using selected activator: <class 'cloudinit.net.activators.NetplanActivator'> from priority: ['netplan', 'eni', 'network-manager', 'networkd']
2025-01-03 01:33:54,239 - subp.py[DEBUG]: Running command ['systemctl', 'is-enabled', 'NetworkManager.service'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 01:33:54,251 - performance.py[DEBUG]: Running ['systemctl', 'is-enabled', 'NetworkManager.service'] took 0.012 seconds
2025-01-03 01:33:54,252 - subp.py[DEBUG]: Running command ['systemctl', 'start', 'systemd-networkd-wait-online.service'] with allowed return codes [0] (shell=False, capture=True)
@crankycyclops crankycyclops added bug Something isn't working correctly new An issue that still needs triage labels Jan 3, 2025
@holmanb
Copy link
Member

holmanb commented Jan 6, 2025

@crankycyclops Please include the required logs in the bug report. Also you sure that3 the version 10.4 is correct? That doesn't seem right to me.

@holmanb holmanb added incomplete Action required by submitter invalid This doesn't seem right and removed new An issue that still needs triage labels Jan 6, 2025
@crankycyclops crankycyclops changed the title cloud-init 10.4 prevents Ubuntu 20.04 Digital Ocean droplet from booting cloud-init 24.4 prevents Ubuntu 20.04 Digital Ocean droplet from booting Jan 6, 2025
@crankycyclops
Copy link
Author

@crankycyclops Please include the required logs in the bug report. Also you sure that3 the version 10.4 is correct? That doesn't seem right to me.

Sorry, I meant 24.4. I updated the title of the issue. I also added the logs from that day.

@ltvkit
Copy link

ltvkit commented Jan 8, 2025

yup, same here. All of my Ubuntu servers cant boot normally, and get frozen on cloud-init stage.
If I boot into recovery, the server boots up fine.

And I've seen on many forums the same. Example: https://forum.virtualmin.com/t/cloud-init-network-stage-hangs/131292/15

Downgrading to cloud-init=23.1.2-0ubuntu0~20.04.2 solves the issue. And I've frozen that package on all of my servers for now.
This has been an issue for ~3 weeks now, but I was too lazy to report it.

@AndreasGocht
Copy link

We had a not working DNS, so I started digging around, and realized, that we did hat networkd activated. It seems that cloud-init did this, but actually failed to bring up the network in the first attempt, leading to resolved not working properly.

about the logs: I won't publish them online, as there are some public IPs and stuff in there (dating back one year!), which i won't like to share. Maybe I can recreate the issue on a non public "younger" machine. Not sure, when I'll find the time. However, I think I can share an excerpt from the cloud-init logs:

2025-01-03 04:00:32,101 - handlers.py[DEBUG]: finish: init-local: SUCCESS: searching for local datasources
2025-01-03 04:00:32,562 - log_util.py[DEBUG]: Cloud-init v. 24.4-0ubuntu1~22.04.1 running 'init' at Fri, 03 Jan 2025 04:00:32 +0000. Up 3.71 seconds.
2025-01-03 04:00:32,562 - main.py[INFO]: PID [1] started cloud-init 'init'.
2025-01-03 04:00:32,562 - main.py[DEBUG]: No kernel command line url found.
2025-01-03 04:00:32,562 - main.py[DEBUG]: Closing stdin
2025-01-03 04:00:32,564 - util.py[DEBUG]: Writing to /var/log/cloud-init.log - ab: [640] 0 bytes
2025-01-03 04:00:32,564 - util.py[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 106:4
2025-01-03 04:00:32,565 - util.py[DEBUG]: Reading from /var/lib/cloud/data/python-version (quiet=False)
2025-01-03 04:00:32,565 - util.py[DEBUG]: Reading 4 bytes from /var/lib/cloud/data/python-version
2025-01-03 04:00:32,565 - main.py[DEBUG]: Will wait for network connectivity before continuing
2025-01-03 04:00:32,566 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2025-01-03 04:00:32,567 - activators.py[DEBUG]: Using selected activator: <class 'cloudinit.net.activators.NetplanActivator'> from priority: ['netplan', 'eni', 'network-manager', 'networkd']
2025-01-03 04:00:32,568 - subp.py[DEBUG]: Running command ['systemctl', 'is-enabled', 'NetworkManager.service'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 04:00:32,572 - subp.py[DEBUG]: Running command ['systemctl', 'start', 'systemd-networkd-wait-online.service'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 04:02:32,698 - performance.py[DEBUG]: Running ['systemctl', 'start', 'systemd-networkd-wait-online.service'] took 120.125 seconds
2025-01-03 04:02:32,698 - ubuntu.py[ERROR]: Failed to wait for network: Unexpected error while running command.
Command: ['systemctl', 'start', 'systemd-networkd-wait-online.service']
Exit code: 1
Reason: -
Stdout: 
Stderr: Job for systemd-networkd-wait-online.service failed because the control process exited with error code.
        See "systemctl status systemd-networkd-wait-online.service" and "journalctl -xeu systemd-networkd-wait-online.service" for details.
2025-01-03 04:02:32,699 - subp.py[DEBUG]: Running command ['ip', '--json', 'addr'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 04:02:32,704 - subp.py[DEBUG]: Running command ['ip', '-o', 'route', 'list'] with allowed return codes [0] (shell=False, capture=True)
2025-01-03 04:02:32,706 - subp.py[DEBUG]: Running command ['ip', '--oneline', '-6', 'route', 'list', 'table', 'all'] with allowed return codes [0, 1] (shell=False, capture=True)
2025-01-03 04:02:32,710 - handlers.py[DEBUG]: start: init-network/check-cache: attempting to read from cache [trust]
2025-01-03 04:02:32,710 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2025-01-03 04:02:32,710 - stages.py[DEBUG]: no cache found
2025-01-03 04:02:32,710 - handlers.py[DEBUG]: finish: init-network/check-cache: SUCCESS: no cache found
2025-01-03 04:02:32,711 - sources[DEBUG]: Looking for data source in: ['Ec2', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM', 'NETWORK']
2025-01-03 04:02:32,716 - sources[DEBUG]: Searching for network data source in: ['DataSourceEc2', 'DataSourceNone']
2025-01-03 04:02:32,716 - handlers.py[DEBUG]: start: init-network/search-Ec2: searching for network data from DataSourceEc2
2025-01-03 04:02:32,716 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2'>
2025-01-03 04:02:32,716 - sources[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2025-01-03 04:02:32,717 - sources[DEBUG]: Detected DataSourceEc2

and systemd-networkd-wait-online.service:

Jan 03 04:02:32 codevault systemd-networkd-wait-online[521]: Timeout occurred while waiting for network connectivity.
Jan 03 04:02:32 codevault systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 03 04:02:32 codevault systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Jan 03 04:02:32 codevault systemd[1]: Failed to start Wait for Network to be Configured.

and

Jan 03 04:00:32 codevault systemd[1]: Starting Cloud-init: Network Stage...
Jan 03 04:00:32 codevault cloud-init[517]: Cloud-init v. 24.4-0ubuntu1~22.04.1 running 'init' at Fri, 03 Jan 2025 04:00:32 +0000. Up 3.71 seconds.
Jan 03 04:02:32 codevault cloud-init[517]: 2025-01-03 04:02:32,698 - ubuntu.py[ERROR]: Failed to wait for network: Unexpected error while running command.
Jan 03 04:02:32 codevault cloud-init[517]: Command: ['systemctl', 'start', 'systemd-networkd-wait-online.service']
Jan 03 04:02:32 codevault cloud-init[517]: Exit code: 1
Jan 03 04:02:32 codevault cloud-init[517]: Reason: -
Jan 03 04:02:32 codevault cloud-init[517]: Stdout:
Jan 03 04:02:32 codevault cloud-init[517]: Stderr: Job for systemd-networkd-wait-online.service failed because the control process exited with error code.
Jan 03 04:02:32 codevault cloud-init[517]:         See "systemctl status systemd-networkd-wait-online.service" and "journalctl -xeu systemd-networkd-wait-online.service" for details.
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: | Device |  Up  |           Address            |      Mask     | Scope  |     Hw-Address    |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |  eth0  | True |        192.168.3.140         | 255.255.255.0 | global | fa:16:3e:b7:dd:58 |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |  eth0  | True | fe80::f816:3eff:feb7:dd58/64 |       .       |  link  | fa:16:3e:b7:dd:58 |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   lo   | True |          127.0.0.1           |   255.0.0.0   |  host  |         .         |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   lo   | True |           ::1/128            |       .       |  host  |         .         |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: | Route |   Destination   |   Gateway   |     Genmask     | Interface | Flags |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   0   |     0.0.0.0     | 192.168.3.1 |     0.0.0.0     |    eth0   |   UG  |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   1   | 169.254.169.254 | 192.168.3.3 | 255.255.255.255 |    eth0   |  UGH  |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   2   |   192.168.3.0   |   0.0.0.0   |  255.255.255.0  |    eth0   |   U   |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +-------+-------------+---------+-----------+-------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: | Route | Destination | Gateway | Interface | Flags |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +-------+-------------+---------+-----------+-------+
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   3   |    local    |    ::   |    eth0   |   U   |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: |   4   |  multicast  |    ::   |    eth0   |   U   |
Jan 03 04:02:32 codevault cloud-init[517]: ci-info: +-------+-------------+---------+-----------+-------+
Jan 03 04:02:36 codevault cloud-init[517]: 2025-01-03 04:02:36,892 - DataSourceEc2.py[WARNING]: Metadata 'network' key not valid: None.
Jan 03 04:02:37 codevault systemd[1]: Finished Cloud-init: Network Stage.

feel free to ask about additional specific log entries.

Best,

Andreas

@TheRealFalcon
Copy link
Member

Thanks for the log snippets. Would it be possible to get the results of

cloud-init query vendordata

along with the contents of /etc/cloud/. E.g., from / run

tar czf logs.tar.gz etc/cloud/

and then attach to this issue?

This shouldn't contain any sensitive data unless it was manually written there.

@Monochromics
Copy link

This is fairly difficult to test, but I /think/ that this all comes down to e30549e

Since prior it was tagged as an After=s-n-w-o, it didn't really care if it was actually successful or not, just that it happened. Since it was swapped to an explicit start call, I think the boot is just failing whenever there is something wrong that prevents s-n-w-o from returning cleanly. Looks like some people are seeing this outside of just digital ocean as well. Presumably something in digital ocean is more reliably triggering this based on the thread? I don't have an acct to see what the c-i layout looks like in those images unfortunately.

The thing is its fairly difficult to trigger, since meeting the requirements to trigger the check itself are fairly stringent. Azure for ex only uses mounts for first-boot, and doesn't use bootcmd/random_seed/w.e. else.

Now what is somewhat interesting is that it looks like the installer .isos (server and desktop) leave di-intentify.cfg in images installed via that method that will force c-i to run even if no ds is found, which would trigger this issue if there was /also/ a network problem that cause w-o to fail. That isn't terribly hard to do though, as just having a disabled nic or something w/o letting netplan know would cause dhcp waiting period to trigger a timeout/exit, then killing c-i and hanging the instance.

All this being said... I don't have a great idea for a solution. As a hacky fix, just changing the systemctl start to ignore the exit code so c-i doesn't fail would work I think? I'd need to do some more testing.

@holmanb
Copy link
Member

holmanb commented Jan 8, 2025

A reproducer or full journal logs + cloud-init logs would make this much easier to debug.

di-intentify.cfg in images installed via that method that will force c-i to run even if no ds is found

Did you observe that? The subiquity autoinstaller disables cloud-init on first boot with /etc/cloud/cloud-init.disabled.

changing the systemctl start to ignore the exit code so c-i doesn't fail would work I think

That wouldn't be reliable. The whole point of calling that binary was to enable cloud-init to run earlier in boot, but only when safe to do so. If this command fails, all bets are off on the "safe to do so" part. Cloud-init could always loop checking the systemd service status when the command fails, but that would obviously not be a great solution.

@TheRealFalcon
Copy link
Member

changing the systemctl start to ignore the exit code so c-i doesn't fail would work I think?

That's essentially what cloud-init already does. We log an error and the process will return 1,but a failure of s-n-w-o doesn't cause a cloud-init crash or early exit.

In the DO case, I have a hunch (but no proof yet) that instances that don't come up are experiencing a dependency loop in systemd. DO's vendor-data writes out a network service file and performs a systemd daemon-reload. That could theoretically be problematic, but I can't reproduce the behavior on a fresh instance. It's likely that only older instances are affected because of the image configuration or vendor data that was used at the time (the c-i logs pasted here are using a deprecated datasource).

@ltvkit
Copy link

ltvkit commented Jan 9, 2025

I might try to get the logs tomorrow, kinda busy atm.
But the servers that are affected for me, are the ones that were originally Ubuntu 16.04/18.04 based, and upgraded to Ubuntu 20.04 and 22.04.

They all have a network interface configured with netplan, and the IP's are set manually, with IPv6/DHCP4/DHCP6 disabled, and using public facing addresses.

@AndreasGocht
Copy link

AndreasGocht commented Jan 9, 2025

Dear all,

please find attached the logs. I added the journal log since boot, and redacted some information (boot.log.tar.gz).

boot.log.tar.gz
logs.tar.gz

sudo cloud-init query vendordata does not seem to output anything (except an empty line).

If you could provide me another way of sharing logs except a public GitHub I am happy to do so.

changing the systemctl start to ignore the exit code so c-i doesn't fail would work I think?

A word on the network manager side: Normally we use neither netplan, nor any of eni, systemd-netword or NetworkManager on our side. I think the network configuration is coming from cloud-init. However, on our side it seems cloud-init tries to start networkd and fails, which leads to resolved failing, and DNS not working. However, the network seemed to be configured correctly.

Best,

Andreas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly incomplete Action required by submitter invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

6 participants