Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

Open
Tracked by #8367
docbobo opened this issue Oct 20, 2024 · 14 comments
Open
Tracked by #8367

"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530

docbobo opened this issue Oct 20, 2024 · 14 comments

Comments

@docbobo
Copy link

docbobo commented Oct 20, 2024

Bug Report

After upgrading from Talos 1.7.6 to 1.8.1, many of the nodes in my cluster failed to come up properly. Instead, they were all repeatedly showing messages like this:

[  187.903990] [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "META", "phase": "failed -> failed", "error": "error probing disk: open /dev/mmcblk0p4: no such file or directory"}
[  188.030306] [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "failed -> failed", "error": "error probing disk: open /dev/mmcblk0p5: no such file or directory"}

The node will be stuck in this loop, though I could reboot back into 1.7.6.

Description

To better understand the problem, let me quickly describe the setup of each of those nodes:

  • SBC (TuringPi RK1)
  • 32GB eMMC Storage
  • 500GB NVME

Originally, I had Talos installed on the eMMC, but later re-installed it onto the NVME, which also required re-writing the boot loader of the eMMC. As mentioned before, this worked flawlessly with 1.7.6 and before.

After some investigation over the weekend, I found out that flashing the boot loader did overwrite the master GPT partition table, yet it kept the backup table in place. This was verified by running gdisk on those nodes (after booting back into 1.7.6), which was pointing out the "corrupted" GPT partition table, allowing me to recover from the backup, which would include partitions like META, STATE, EPHEMERAL, and more. Clearing he partition table would allow me to properly boot those nodes into 1.8.1.

Continuing the analysis, the question was why Talos was trying to probe for partitions that apparently the kernel does not see. The answer seems to be buried in the new volume management, in particular the following lines in go-blockdevices:

https://github.com/siderolabs/go-blockdevice/blob/f63c85da5406ec4e8b0ad72f138cd663aec989b5/partitioning/gpt/gpt.go#L160-L165

This code segment will fall back to the backup GPT table if the master partition table is returned as empty. As a result, the volume management will assume that it can mount META, STATE, and others from eMMC. However, the kernel seems to ignore the backup table - or maybe it just handles it differently - and as a result, the corresponding devices are not accessible.

I did a quick test building a standalone executable and in the current form, the code will list the following partitions for my eMMC, even though the master partition table is empty:

EFI
BIOS
BOOT
META
STATE
EPHEMERAL

Removing the code, I will receive a no GPT header found which seems to be the more appropriate response in my situation, given that the kernel sees none of the partitions above.

The other thing to keep in mind, is the fact that all of these partitions are correctly set up in the master GPT of the NVME. Those are completely ignored during boot, even though they seem to be valid candidates.

Logs

Environment

  • Talos version: 1.8.1
  • Kubernetes version: 1.30.5
  • Platform: TuringPi RK1

@nberlee fyi

@docbobo
Copy link
Author

docbobo commented Oct 20, 2024

I can keep one or two nodes around for a day or two, if you need more information.

@smira
Copy link
Member

smira commented Oct 21, 2024

Ok, so we have two issues here:

  • Talos uses backup header GPT (which is correct behavior, but the missing pieces probably is that it doesn't try to fix GPT using backup header)
  • Talos doesn't support having multiple system partitions

The first thing we should probably fix eventually, and I can see what we can do here. Using backup GPT is correct behavior.

Multiple system partitions (that is two disks with system partition labels) will never be supported properly, as Talos has no way to know which one to use.

So the combined effect of the two made your system unbootable, but the fix is to wipe the disk properly, so now your system is in a better shape.

@nberlee
Copy link
Contributor

nberlee commented Oct 21, 2024

Is there a point in probing a partition which the kernel did not make a device node for? (e.g. backup gpt which the kernel does not read)

@smira
Copy link
Member

smira commented Oct 21, 2024

Is there a point in probing a partition which the kernel did not make a device node for? (e.g. backup gpt which the kernel does not read)

I guess you would not want your whole disk to be lost if it's recoverable?

@docbobo
Copy link
Author

docbobo commented Oct 21, 2024

  • Talos uses backup header GPT (which is correct behavior, but the missing pieces probably is that it doesn't try to fix GPT using backup header)

Is that actually the correct behaviour? It seems to be inconsistent with how the kernel is handling things, as far as I can tell from observing the system's behaviour. Also, I am not sure if automatically repairing something as crucial as this is a good idea either, because it might lead to data loss as well.

@smira
Copy link
Member

smira commented Oct 21, 2024

Is that actually the correct behaviour? It seems to be inconsistent with how the kernel is handling things, as far as I can tell from observing the system's behaviour. Also, I am not sure if automatically repairing something as crucial as this is a good idea either, because it might lead to data loss as well.

GPT headers are checksummed, so valid/invalid headers are easy to detect.

@docbobo
Copy link
Author

docbobo commented Oct 21, 2024

GPT headers are checksummed, so valid/invalid headers are easy to detect.

Fair enough. Yet the kernel does not just fall back to using the backup table. What's the reason for that?

@smira
Copy link
Member

smira commented Oct 21, 2024

GPT headers are checksummed, so valid/invalid headers are easy to detect.

Fair enough. Yet the kernel does not just fall back to using the backup table. What's the reason for that?

Certainly I'm not in a position to answer that :)

@docbobo
Copy link
Author

docbobo commented Oct 21, 2024

I think my main point is this: even if the current behaviour is correct - and I believe there might be different perspectives on that, based on the fact that the kernel apparently does not fallback to the backup gpt table - it results in a non-working installation after the upgrade.

To me, this feels a little bit like a regression. Purely from a "customer perspective", I wouldn't expect the upgrade to fail just because 1.8.x is now behaving correctly 😉 Now, preventing the upgrade due to a corrupt GPT table would be a totally different story as far as I am concerned..

@docbobo
Copy link
Author

docbobo commented Oct 21, 2024

Oh, by the way. I think in my case, repairing the GPT would've broken both the 1.7.6 and the 1.8.1 installations, requiring manual intervention.

@smira
Copy link
Member

smira commented Oct 21, 2024

Bug fixes unfortunately reveal old issue which need to be fixed nevertheless. But I think you're unblocked for now, which is good.

@docbobo
Copy link
Author

docbobo commented Oct 21, 2024

I am but there is a pretty good chance a few others will run into this.

@jnohlgard
Copy link
Contributor

jnohlgard commented Oct 22, 2024

I ran into this issue when I flashed the Bootable iso onto a usb stick that previously held a Talos installation. The volume controller probes the backup GPT table found at the end of the block device at a higher priority than the iso9660 header found at the beginning of the disk, where the primary GPT would be located. Since the iso just contains an initrd I could boot by yanking the usb drive when I saw the first kernel output before hitting userspace start.

One possible improvement could be to have a separate probe step for the backup table and have some extra recovery steps around that.
It would have helped me to just have a message saying that it was loading the backup GPT to save the time it took me to dig through the sources to figure out why the boot process was trying to use nonexistent STATE and META partitions from my USB drive.

The problem was that because of this leftover backup it thought that Talos was already installed and the system refused to go into maintenance mode.

@smira
Copy link
Member

smira commented Oct 22, 2024

Yes, there is something which needs to be done, not sure what exactly yet, I've pinned this issue to the volume management epic to make a decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants