"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

docbobo · 2024-10-20T17:56:26Z

Bug Report

After upgrading from Talos 1.7.6 to 1.8.1, many of the nodes in my cluster failed to come up properly. Instead, they were all repeatedly showing messages like this:

[  187.903990] [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "META", "phase": "failed -> failed", "error": "error probing disk: open /dev/mmcblk0p4: no such file or directory"}
[  188.030306] [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "failed -> failed", "error": "error probing disk: open /dev/mmcblk0p5: no such file or directory"}

The node will be stuck in this loop, though I could reboot back into 1.7.6.

Description

To better understand the problem, let me quickly describe the setup of each of those nodes:

SBC (TuringPi RK1)
32GB eMMC Storage
500GB NVME

Originally, I had Talos installed on the eMMC, but later re-installed it onto the NVME, which also required re-writing the boot loader of the eMMC. As mentioned before, this worked flawlessly with 1.7.6 and before.

After some investigation over the weekend, I found out that flashing the boot loader did overwrite the master GPT partition table, yet it kept the backup table in place. This was verified by running gdisk on those nodes (after booting back into 1.7.6), which was pointing out the "corrupted" GPT partition table, allowing me to recover from the backup, which would include partitions like META, STATE, EPHEMERAL, and more. Clearing he partition table would allow me to properly boot those nodes into 1.8.1.

Continuing the analysis, the question was why Talos was trying to probe for partitions that apparently the kernel does not see. The answer seems to be buried in the new volume management, in particular the following lines in go-blockdevices:

https://github.com/siderolabs/go-blockdevice/blob/f63c85da5406ec4e8b0ad72f138cd663aec989b5/partitioning/gpt/gpt.go#L160-L165

This code segment will fall back to the backup GPT table if the master partition table is returned as empty. As a result, the volume management will assume that it can mount META, STATE, and others from eMMC. However, the kernel seems to ignore the backup table - or maybe it just handles it differently - and as a result, the corresponding devices are not accessible.

I did a quick test building a standalone executable and in the current form, the code will list the following partitions for my eMMC, even though the master partition table is empty:

EFI
BIOS
BOOT
META
STATE
EPHEMERAL

Removing the code, I will receive a no GPT header found which seems to be the more appropriate response in my situation, given that the kernel sees none of the partitions above.

The other thing to keep in mind, is the fact that all of these partitions are correctly set up in the master GPT of the NVME. Those are completely ignored during boot, even though they seem to be valid candidates.

Logs

Environment

Talos version: 1.8.1
Kubernetes version: 1.30.5
Platform: TuringPi RK1

@nberlee fyi

The text was updated successfully, but these errors were encountered:

docbobo · 2024-10-20T18:10:32Z

I can keep one or two nodes around for a day or two, if you need more information.

smira · 2024-10-21T11:16:21Z

Ok, so we have two issues here:

Talos uses backup header GPT (which is correct behavior, but the missing pieces probably is that it doesn't try to fix GPT using backup header)
Talos doesn't support having multiple system partitions

The first thing we should probably fix eventually, and I can see what we can do here. Using backup GPT is correct behavior.

Multiple system partitions (that is two disks with system partition labels) will never be supported properly, as Talos has no way to know which one to use.

So the combined effect of the two made your system unbootable, but the fix is to wipe the disk properly, so now your system is in a better shape.

nberlee · 2024-10-21T12:22:44Z

Is there a point in probing a partition which the kernel did not make a device node for? (e.g. backup gpt which the kernel does not read)

smira · 2024-10-21T12:25:31Z

Is there a point in probing a partition which the kernel did not make a device node for? (e.g. backup gpt which the kernel does not read)

I guess you would not want your whole disk to be lost if it's recoverable?

docbobo · 2024-10-21T12:27:27Z

Talos uses backup header GPT (which is correct behavior, but the missing pieces probably is that it doesn't try to fix GPT using backup header)

Is that actually the correct behaviour? It seems to be inconsistent with how the kernel is handling things, as far as I can tell from observing the system's behaviour. Also, I am not sure if automatically repairing something as crucial as this is a good idea either, because it might lead to data loss as well.

smira · 2024-10-21T12:29:19Z

Is that actually the correct behaviour? It seems to be inconsistent with how the kernel is handling things, as far as I can tell from observing the system's behaviour. Also, I am not sure if automatically repairing something as crucial as this is a good idea either, because it might lead to data loss as well.

GPT headers are checksummed, so valid/invalid headers are easy to detect.

docbobo · 2024-10-21T12:34:23Z

GPT headers are checksummed, so valid/invalid headers are easy to detect.

Fair enough. Yet the kernel does not just fall back to using the backup table. What's the reason for that?

smira · 2024-10-21T13:07:15Z

GPT headers are checksummed, so valid/invalid headers are easy to detect.

Fair enough. Yet the kernel does not just fall back to using the backup table. What's the reason for that?

Certainly I'm not in a position to answer that :)

docbobo · 2024-10-21T13:20:29Z

I think my main point is this: even if the current behaviour is correct - and I believe there might be different perspectives on that, based on the fact that the kernel apparently does not fallback to the backup gpt table - it results in a non-working installation after the upgrade.

To me, this feels a little bit like a regression. Purely from a "customer perspective", I wouldn't expect the upgrade to fail just because 1.8.x is now behaving correctly 😉 Now, preventing the upgrade due to a corrupt GPT table would be a totally different story as far as I am concerned..

docbobo · 2024-10-21T13:24:25Z

Oh, by the way. I think in my case, repairing the GPT would've broken both the 1.7.6 and the 1.8.1 installations, requiring manual intervention.

smira · 2024-10-21T15:06:18Z

Bug fixes unfortunately reveal old issue which need to be fixed nevertheless. But I think you're unblocked for now, which is good.

docbobo · 2024-10-21T19:07:57Z

I am but there is a pretty good chance a few others will run into this.

jnohlgard · 2024-10-22T07:12:16Z

I ran into this issue when I flashed the Bootable iso onto a usb stick that previously held a Talos installation. The volume controller probes the backup GPT table found at the end of the block device at a higher priority than the iso9660 header found at the beginning of the disk, where the primary GPT would be located. Since the iso just contains an initrd I could boot by yanking the usb drive when I saw the first kernel output before hitting userspace start.

One possible improvement could be to have a separate probe step for the backup table and have some extra recovery steps around that.
It would have helped me to just have a message saying that it was loading the backup GPT to save the time it took me to dig through the sources to figure out why the boot process was trying to use nonexistent STATE and META partitions from my USB drive.

The problem was that because of this leftover backup it thought that Talos was already installed and the system refused to go into maintenance mode.

smira · 2024-10-22T07:15:04Z

Yes, there is something which needs to be done, not sure what exactly yet, I've pinned this issue to the volume management epic to make a decision.

smira mentioned this issue Oct 21, 2024

Volume Management #8367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

docbobo commented Oct 20, 2024

docbobo commented Oct 20, 2024

smira commented Oct 21, 2024

nberlee commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

docbobo commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

jnohlgard commented Oct 22, 2024 •

edited

Loading

smira commented Oct 22, 2024

"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

Comments

docbobo commented Oct 20, 2024

Bug Report

Description

Logs

Environment

docbobo commented Oct 20, 2024

smira commented Oct 21, 2024

nberlee commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

docbobo commented Oct 21, 2024

smira commented Oct 21, 2024

docbobo commented Oct 21, 2024

jnohlgard commented Oct 22, 2024 • edited Loading

smira commented Oct 22, 2024

jnohlgard commented Oct 22, 2024 •

edited

Loading