-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Error probing disks" after upgrade from 1.7.6 to 1.8.1 #9530
Comments
I can keep one or two nodes around for a day or two, if you need more information. |
Ok, so we have two issues here:
The first thing we should probably fix eventually, and I can see what we can do here. Using backup GPT is correct behavior. Multiple system partitions (that is two disks with system partition labels) will never be supported properly, as Talos has no way to know which one to use. So the combined effect of the two made your system unbootable, but the fix is to wipe the disk properly, so now your system is in a better shape. |
Is there a point in probing a partition which the kernel did not make a device node for? (e.g. backup gpt which the kernel does not read) |
I guess you would not want your whole disk to be lost if it's recoverable? |
Is that actually the correct behaviour? It seems to be inconsistent with how the kernel is handling things, as far as I can tell from observing the system's behaviour. Also, I am not sure if automatically repairing something as crucial as this is a good idea either, because it might lead to data loss as well. |
GPT headers are checksummed, so valid/invalid headers are easy to detect. |
Fair enough. Yet the kernel does not just fall back to using the backup table. What's the reason for that? |
Certainly I'm not in a position to answer that :) |
I think my main point is this: even if the current behaviour is correct - and I believe there might be different perspectives on that, based on the fact that the kernel apparently does not fallback to the backup gpt table - it results in a non-working installation after the upgrade. To me, this feels a little bit like a regression. Purely from a "customer perspective", I wouldn't expect the upgrade to fail just because 1.8.x is now behaving correctly 😉 Now, preventing the upgrade due to a corrupt GPT table would be a totally different story as far as I am concerned.. |
Oh, by the way. I think in my case, repairing the GPT would've broken both the 1.7.6 and the 1.8.1 installations, requiring manual intervention. |
Bug fixes unfortunately reveal old issue which need to be fixed nevertheless. But I think you're unblocked for now, which is good. |
I am but there is a pretty good chance a few others will run into this. |
I ran into this issue when I flashed the Bootable iso onto a usb stick that previously held a Talos installation. The volume controller probes the backup GPT table found at the end of the block device at a higher priority than the iso9660 header found at the beginning of the disk, where the primary GPT would be located. Since the iso just contains an initrd I could boot by yanking the usb drive when I saw the first kernel output before hitting userspace start. One possible improvement could be to have a separate probe step for the backup table and have some extra recovery steps around that. The problem was that because of this leftover backup it thought that Talos was already installed and the system refused to go into maintenance mode. |
Yes, there is something which needs to be done, not sure what exactly yet, I've pinned this issue to the volume management epic to make a decision. |
Bug Report
After upgrading from Talos 1.7.6 to 1.8.1, many of the nodes in my cluster failed to come up properly. Instead, they were all repeatedly showing messages like this:
The node will be stuck in this loop, though I could reboot back into 1.7.6.
Description
To better understand the problem, let me quickly describe the setup of each of those nodes:
Originally, I had Talos installed on the eMMC, but later re-installed it onto the NVME, which also required re-writing the boot loader of the eMMC. As mentioned before, this worked flawlessly with 1.7.6 and before.
After some investigation over the weekend, I found out that flashing the boot loader did overwrite the master GPT partition table, yet it kept the backup table in place. This was verified by running
gdisk
on those nodes (after booting back into 1.7.6), which was pointing out the "corrupted" GPT partition table, allowing me to recover from the backup, which would include partitions like META, STATE, EPHEMERAL, and more. Clearing he partition table would allow me to properly boot those nodes into 1.8.1.Continuing the analysis, the question was why Talos was trying to probe for partitions that apparently the kernel does not see. The answer seems to be buried in the new volume management, in particular the following lines in go-blockdevices:
https://github.com/siderolabs/go-blockdevice/blob/f63c85da5406ec4e8b0ad72f138cd663aec989b5/partitioning/gpt/gpt.go#L160-L165
This code segment will fall back to the backup GPT table if the master partition table is returned as empty. As a result, the volume management will assume that it can mount META, STATE, and others from eMMC. However, the kernel seems to ignore the backup table - or maybe it just handles it differently - and as a result, the corresponding devices are not accessible.
I did a quick test building a standalone executable and in the current form, the code will list the following partitions for my eMMC, even though the master partition table is empty:
Removing the code, I will receive a
no GPT header found
which seems to be the more appropriate response in my situation, given that the kernel sees none of the partitions above.The other thing to keep in mind, is the fact that all of these partitions are correctly set up in the master GPT of the NVME. Those are completely ignored during boot, even though they seem to be valid candidates.
Logs
Environment
@nberlee fyi
The text was updated successfully, but these errors were encountered: