Nodes stuck on boot with Waiting for service "cri" to be registered #9732

divanikus · 2024-11-15T13:58:35Z

Bug Report

After updating to 1.8.x series, including fresh install, some of our servers are unable to boot. Talos API responds, but kubelet is not starting and server stucks in "Booting" stage forever.
Last message from logs is Waiting for service "cri" to be registered
A server can boot properly after several reboot attempts.

Description

Nodes won't boot properly.

Logs

logs.zip

Environment

Talos version: 1.8.2
Kubernetes version: 1.30.3
Platform: amd64

The text was updated successfully, but these errors were encountered:

smira · 2024-11-18T08:52:06Z

There's something else going on here - the CRI doesn't start yet, as the user disks haven't been mounted.

I don't have full picture (config + expected layout) to say why, but I can see messages like this in the log:

10.2.15.100: user: warning: [2024-11-15T13:02:29.859796931Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "/dev/nvme1n1-1", "phase": "waiting -> failed", "error": "filesystem type mismatch: vfat != xfs"}

divanikus · 2024-11-18T13:38:59Z

Nothing special in terms of layout. /dev/nvme0n1 is a system disk, default partitioning, /dev/nvme1n1 is used as data disk:

    disks:
        - device: /dev/nvme1n1
          partitions:
            - mountpoint: /var/openebs/local

Turns out that on the problematic servers kernel seem to swap nvme0n1 and nvme1n1. I mean i've checked those stuck servers and system disk is reported as /dev/nvme1n1, while data as /dev/nvme0n1. A couple of reboots needed to settle them back.

I'm unsure why is this happening. It's like 10% of whole setup of semi-identical servers.

smira · 2024-11-18T14:11:08Z

You might have better luck with /dev/disk-by-id/... stuff. We are looking for better user volume support, but anyways there should be a way to match properly.

divanikus · 2024-11-18T14:17:14Z

I don't see how /dev/disk-by-id/ is scalable for say 50 servers. Submitting per server configuration sounds like a lot of unnecessary work.

smira · 2024-11-18T14:28:10Z

There's no other way at the moment unfortunately, either way you want to match on a disk which is not a system disk.

divanikus · 2024-11-18T14:29:59Z

I wonder why it swaps them anyway.

smira · 2024-11-18T14:54:15Z

There's a bit of randomness in the way Linux enumerates devices in general, which applies e.g. to network interfaces as well.

This is why those stable by-id symlinks were implemented in the first place. We plan to implement better support with disk selectors for user disks, but this is not available right now.

divanikus · 2024-11-19T12:35:43Z

BTW, 1.7.x used to have this error

 user: warning: [2024-11-19T12:32:35.95932376Z]: [talos] task mountUserDisks (1/1): skipping setup of "/dev/nvme1n1", found existing partitions
 kern: warning: [2024-11-19T12:32:35.96166776Z]: XFS (nvme1n1p1): Invalid superblock magic number
 user: warning: [2024-11-19T12:32:36.05590876Z]: [talos] task mountUserDisks (1/1): mountUserDisks failed, rebooting in 35 minutes. You can use
 talosctl apply-config or talosctl edit mc to fix the issues, error:
 user: warning: [2024-11-19T12:32:36.90599076Z]: error mounting "/dev/nvme1n1p1": error mounting: 1 error(s) occurred:
 user: warning: [2024-11-19T12:32:36.90600176Z]:  error repairing: xfs_repair: exit status 1: Phase 1 - find and verify superblock...
 user: warning: [2024-11-19T12:32:36.90600376Z]: bad primary superblock - bad magic number !!!
 user: warning: [2024-11-19T12:32:36.90600576Z]:
 user: warning: [2024-11-19T12:32:36.90600676Z]: attempting to find secondary superblock...
 user: warning:
 [2024-11-19T12:32:36.90601676Z]: .........................................................................................Sorry, could not
 find valid secondary superblock
 user: warning: [2024-11-19T12:32:36.90602876Z]: Exiting now.

Which was more verbose and at least caused a reboot. Seems like 1.8.x would stuck endlessly.

divanikus · 2024-11-19T17:52:51Z

I'm wondering, why Talos doesn't mount by UUID like other distros do. Like, you know, you can specify /dev/nvme0n1 at install time, but installer would resolve it to UUID and adjust fstab accordingly. I do understand that Talos is not a generic linux distro, but anyways?

smira · 2024-11-19T18:01:47Z

Please see my response in the beginning of the issue:

We are looking for better user volume support.

I submitted all possible information here, I'll link this issue to #8367, but otherwise there's nothing we can do at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes stuck on boot with Waiting for service "cri" to be registered #9732

Nodes stuck on boot with Waiting for service "cri" to be registered #9732

divanikus commented Nov 15, 2024 •

edited

Loading

smira commented Nov 18, 2024

divanikus commented Nov 18, 2024

smira commented Nov 18, 2024

divanikus commented Nov 18, 2024

smira commented Nov 18, 2024

divanikus commented Nov 18, 2024

smira commented Nov 18, 2024

divanikus commented Nov 19, 2024

divanikus commented Nov 19, 2024

smira commented Nov 19, 2024

Nodes stuck on boot with Waiting for service "cri" to be registered #9732

Nodes stuck on boot with Waiting for service "cri" to be registered #9732

Comments

divanikus commented Nov 15, 2024 • edited Loading

Bug Report

Description

Logs

Environment

smira commented Nov 18, 2024

divanikus commented Nov 18, 2024

smira commented Nov 18, 2024

divanikus commented Nov 18, 2024

smira commented Nov 18, 2024

divanikus commented Nov 18, 2024

smira commented Nov 18, 2024

divanikus commented Nov 19, 2024

divanikus commented Nov 19, 2024

smira commented Nov 19, 2024

divanikus commented Nov 15, 2024 •

edited

Loading