Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes stuck on boot with Waiting for service "cri" to be registered #9732

Open
divanikus opened this issue Nov 15, 2024 · 10 comments
Open

Nodes stuck on boot with Waiting for service "cri" to be registered #9732

divanikus opened this issue Nov 15, 2024 · 10 comments

Comments

@divanikus
Copy link

divanikus commented Nov 15, 2024

Bug Report

After updating to 1.8.x series, including fresh install, some of our servers are unable to boot. Talos API responds, but kubelet is not starting and server stucks in "Booting" stage forever.
Last message from logs is Waiting for service "cri" to be registered
A server can boot properly after several reboot attempts.

Description

Nodes won't boot properly.

Logs

logs.zip

Environment

  • Talos version: 1.8.2
  • Kubernetes version: 1.30.3
  • Platform: amd64
@smira
Copy link
Member

smira commented Nov 18, 2024

There's something else going on here - the CRI doesn't start yet, as the user disks haven't been mounted.

I don't have full picture (config + expected layout) to say why, but I can see messages like this in the log:

10.2.15.100: user: warning: [2024-11-15T13:02:29.859796931Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "/dev/nvme1n1-1", "phase": "waiting -> failed", "error": "filesystem type mismatch: vfat != xfs"}

@divanikus
Copy link
Author

Nothing special in terms of layout. /dev/nvme0n1 is a system disk, default partitioning, /dev/nvme1n1 is used as data disk:

    disks:
        - device: /dev/nvme1n1
          partitions:
            - mountpoint: /var/openebs/local

Turns out that on the problematic servers kernel seem to swap nvme0n1 and nvme1n1. I mean i've checked those stuck servers and system disk is reported as /dev/nvme1n1, while data as /dev/nvme0n1. A couple of reboots needed to settle them back.

I'm unsure why is this happening. It's like 10% of whole setup of semi-identical servers.

@smira
Copy link
Member

smira commented Nov 18, 2024

You might have better luck with /dev/disk-by-id/... stuff. We are looking for better user volume support, but anyways there should be a way to match properly.

@divanikus
Copy link
Author

I don't see how /dev/disk-by-id/ is scalable for say 50 servers. Submitting per server configuration sounds like a lot of unnecessary work.

@smira
Copy link
Member

smira commented Nov 18, 2024

There's no other way at the moment unfortunately, either way you want to match on a disk which is not a system disk.

@divanikus
Copy link
Author

I wonder why it swaps them anyway.

@smira
Copy link
Member

smira commented Nov 18, 2024

There's a bit of randomness in the way Linux enumerates devices in general, which applies e.g. to network interfaces as well.

This is why those stable by-id symlinks were implemented in the first place. We plan to implement better support with disk selectors for user disks, but this is not available right now.

@divanikus
Copy link
Author

BTW, 1.7.x used to have this error

 user: warning: [2024-11-19T12:32:35.95932376Z]: [talos] task mountUserDisks (1/1): skipping setup of "/dev/nvme1n1", found existing partitions
 kern: warning: [2024-11-19T12:32:35.96166776Z]: XFS (nvme1n1p1): Invalid superblock magic number
 user: warning: [2024-11-19T12:32:36.05590876Z]: [talos] task mountUserDisks (1/1): mountUserDisks failed, rebooting in 35 minutes. You can use
 talosctl apply-config or talosctl edit mc to fix the issues, error:
 user: warning: [2024-11-19T12:32:36.90599076Z]: error mounting "/dev/nvme1n1p1": error mounting: 1 error(s) occurred:
 user: warning: [2024-11-19T12:32:36.90600176Z]:  error repairing: xfs_repair: exit status 1: Phase 1 - find and verify superblock...
 user: warning: [2024-11-19T12:32:36.90600376Z]: bad primary superblock - bad magic number !!!
 user: warning: [2024-11-19T12:32:36.90600576Z]:
 user: warning: [2024-11-19T12:32:36.90600676Z]: attempting to find secondary superblock...
 user: warning:
 [2024-11-19T12:32:36.90601676Z]: .........................................................................................Sorry, could not
 find valid secondary superblock
 user: warning: [2024-11-19T12:32:36.90602876Z]: Exiting now.

Which was more verbose and at least caused a reboot. Seems like 1.8.x would stuck endlessly.

@divanikus
Copy link
Author

I'm wondering, why Talos doesn't mount by UUID like other distros do. Like, you know, you can specify /dev/nvme0n1 at install time, but installer would resolve it to UUID and adjust fstab accordingly. I do understand that Talos is not a generic linux distro, but anyways?

@smira
Copy link
Member

smira commented Nov 19, 2024

Please see my response in the beginning of the issue:

We are looking for better user volume support.

I submitted all possible information here, I'll link this issue to #8367, but otherwise there's nothing we can do at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants