Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker node is stuck in grub rescue mode after talosctl reboot #9407

Closed
lwbt opened this issue Sep 30, 2024 · 5 comments
Closed

Worker node is stuck in grub rescue mode after talosctl reboot #9407

lwbt opened this issue Sep 30, 2024 · 5 comments

Comments

@lwbt
Copy link

lwbt commented Sep 30, 2024

Bug Report

Description

I ran talosctl reboot to reboot a worker node or an entire cluster. Worker nodes 1 and 2 were unable to successfully reboot. The controller node rebooted just fine. Worker node 3 had a different issue, which I intended solve with the reboot, but it was actually not connected. I used worker 3 to reproduce the issue, see details below.

Logs

Environment

  • Talos version: v1.8.0
  • Kubernetes version: v1.31.1
  • Platform: SBC (Raspberry Pi4 compute module in https://computeblade.com/ )

$  talosctl reboot --nodes 192.168.8.23 --endpoints 192.168.8.11 --talosconfig=./talosconfig
◰ watching nodes: [192.168.8.23]
    * 192.168.8.23: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 9 in reboot sequence: task 1/1: failed, error mounting partitions: error mounting /dev/nvme0n1p3: 1 error(s) occurred:
    error repairing: xfs_repair: exit status 1: 5
cleared inode 155
UUID mismatch on inode 156
cleared inode 156
UUID mismatch on inode 157
cleared inode 157
UUID mismatch on inode 158
cleared inode 158
UUID mismatch on inode 159
cleared inode 159
imap claims inode 160 is present, but inode cluster is sparse, correcting imap
imap claims inode 161 is present, but inode cluster is sparse, correcting imap
imap claims inode 162 is present, but inode cluster is sparse, correcting imap
imap claims inode 163 is present, but inode cluster is sparse, correcting imap
imap claims inode 164 is present, but inode cluster is sparse, correcting imap
imap claims inode 165 is present, but inode cluster is sparse, correcting imap
imap claims inode 166 is present, but inode cluster is sparse, correcting imap
imap claims inode 167 is present, but inode cluster is sparse, correcting imap
imap claims inode 168 is present, but inode cluster is sparse, correcting imap
imap claims inode 169 is present, but inode cluster is sparse, correcting imap
imap claims inode 170 is present, but inode cluster is sparse, correcting imap
imap claims inode 171 is present, but inode cluster is sparse, correcting imap
imap claims inode 172 is present, but inode cluster is sparse, correcting imap
imap claims inode 173 is present, but inode cluster is sparse, correcting imap
imap claims inode 174 is present, but inode cluster is sparse, correcting imap
imap claims inode 175 is present, but inode cluster is sparse, correcting imap
imap claims inode 176 is present, but inode cluster is sparse, correcting imap
imap claims inode 177 is present, but inode cluster is sparse, correcting imap
imap claims inode 178 is present, but inode cluster is sparse, correcting imap
imap claims inode 179 is present, but inode cluster is sparse, correcting imap
imap claims inode 180 is present, but inode cluster is sparse, correcting imap
imap claims inode 181 is present, but inode cluster is sparse, correcting imap
imap claims inode 182 is present, but inode cluster is sparse, correcting imap
imap claims inode 183 is present, but inode cluster is sparse, correcting imap
imap claims inode 184 is present, but inode cluster is sparse, correcting imap
imap claims inode 185 is present, but inode cluster is sparse, correcting imap
imap claims inode 186 is present, but inode cluster is sparse, correcting imap
imap claims inode 187 is present, but inode cluster is sparse, correcting imap
imap claims inode 188 is present, but inode cluster is sparse, correcting imap
imap claims inode 189 is present, but inode cluster is sparse, correcting imap
imap claims inode 190 is present, but inode cluster is sparse, correcting imap
imap claims inode 191 is present, but inode cluster is sparse, correcting imap
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
root inode lost
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
reinitializing realtime bitmap inode
reinitializing realtime summary inode
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
SB summary counter sanity check failed
Metadata corruption detected at 0xaaaac17deba0, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
SB summary counter sanity check failed
Metadata corruption detected at 0xaaaac17deba0, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
xfs_repair: Releasing dirty buffer to free list!
xfs_repair: Refusing to write a corrupt buffer to the data device!
xfs_repair: Lost a write to the data device!

fatal error -- File system metadata writeout failed, err=117.  Re-run xfs_repair.

mpv_2024-10-01-010813_video0_00:09:18 516

@smira
Copy link
Member

smira commented Oct 1, 2024

This looks like disk corruption to me, or some other hardware issue.

@lwbt
Copy link
Author

lwbt commented Oct 1, 2024

It obviously looks like it, but the filesystem where grub and normal.mod are stored are not XFS if I recall correctly (FAT/EXT4?). This is reproducible on Talos Linux only. I have been running Ubuntu before on these compute blades and have not encountered such an issue yet.

Also notice that the U-Boot logo shows up in wrong colors. I tested with v1.7 and v1.8 on compute blade and regular Raspberry Pi 4 B. I was going to open an issue for that, as it happens on both devices only on v1.8, v1.7 was fine.

siderolabs/sbc-raspberrypi#22

@smira
Copy link
Member

smira commented Oct 1, 2024

The grub filesystem is xfs, and it is not even mounted during normal operations, so the corruption should have happened at the moment it was written.

The SBC support is a community-driven effort, see https://github.com/siderolabs/sbc-raspberrypi/

@lwbt
Copy link
Author

lwbt commented Oct 2, 2024

I just tested with a regular Raspberry Pi 4 B and the issue was not reproducible there.

@lwbt
Copy link
Author

lwbt commented Oct 11, 2024

I just upgraded from 1.8.0 to 1.8.1 and the issue is not reproducible any more for me.

Note: I forgot to update the client binary first, which resulted in an upgrade from 1.8.0 to 1.8.0. So may be an upgrade even to the same version number helps when this occurs.

I'm closing this issue now.

@lwbt lwbt closed this as completed Oct 11, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants