Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest next VMWare OVA Fails To Boot #1802

Closed
fifofonix opened this issue Sep 24, 2024 · 42 comments
Closed

Latest next VMWare OVA Fails To Boot #1802

fifofonix opened this issue Sep 24, 2024 · 42 comments
Assignees
Labels

Comments

@fifofonix
Copy link

fifofonix commented Sep 24, 2024

Describe the bug

When launching a sans ignition Fedora41 next OVA in VMWare Workstation on Windows the VM fails to boot with the message "The firmware encountered an unexpected exception. The vfirtual machine cannot boot." When using the testing Fedora40 OVA the VM boots to a login prompt without issue.

Separately, CICD scripts that deploy the same OVAs using OpenTofu to a server VMWare vSphere infrastructure, also fail although without such a message. In the server deployment case the VMs will be listed in vSphere but will be in an 'off' status, with any power on attempts yielding an 'off' status. No console messages produced or error messages. Again the same projects using testing deploy just fine.

Reproduction steps

  1. Download OVA
  2. Attempt to launch via VMWare Workstation

Expected behavior

VM should boot to login as it does for prior FCOS versions

Actual behavior

As described above.

System details

  • VMWare Workstation or VMWare vSphere

Butane or Ignition config

None

Additional information

image

@dustymabe
Copy link
Member

so there's no messages at all on the console of the VMWare machines? Does the VM even attempt to boot at all or is it something happening at the VMware level that is causing it to not work at all?

What happens if you boot a testing machine, but rebase it to next?

@fifofonix
Copy link
Author

No console/boot messages at all so it seems like there is something wrong with the OVA.

To all intents and purposes the VM in vSphere looks the same as a testing one, ie. same vmware virtual machine version #.

Rebasing a testing machine to next works fine.

Also, I have re-confirmed today that the OVA deployment issue exists with the very latest next, ie. 41.20240922.1.0.

@dustymabe
Copy link
Member

Also, I have re-confirmed today that the OVA deployment issue exists with the very latest next, ie. 41.20240922.1.0.

Can you also confirm it DOES NOT exist with the lastest testing: 40.20240920.2.0 ?

@fifofonix
Copy link
Author

Confirmed. Overnight testing CICD deployed canary VM without issues.

@dustymabe
Copy link
Member

We use the exact same build container to build testing and next so there should be no difference in how the OVA is constructed. That would indicate to me there is a problem inside the OS (i.e. kernel, grub, or something), but rebasing from testing to next would test that theory and you said that rebasing works too.

I'm really not sure. I would expect something to come across the console that we could use to investigate, but you say there is nothing there either :(

@dustymabe
Copy link
Member

That would indicate to me there is a problem inside the OS (i.e. kernel, grub, or something), but rebasing from testing to next would test that theory and you said that rebasing works too.

ahh. rebasing from testing to next wouldn't update the bootloader that's installed.

Can you run sudo bootupctl update on that rebased system and then reboot to see if it then fails to boot?

@fifofonix
Copy link
Author

This replicate the issue with the node failing to reboot and failing to reboot when manual power on signal is given via vSphere console. For the record this was the output I got when applying bootupctl update. Hopefully, this means you can narrow in on what the issue is?

me@t-canary-vm:~$ sudo bootupctl update
Running as unit: bootupd.service
Previous BIOS: grub2-tools-1:2.06-123.fc40.x86_64
Updated BIOS: grub2-tools-1:2.12-4.fc41.x86_64
Previous EFI: grub2-efi-x64-1:2.06-123.fc40.x86_64,shim-x64-15.8-3.x86_64
Updated EFI: grub2-efi-x64-1:2.12-4.fc41.x86_64,shim-x64-15.8-3.x86_64

@dustymabe
Copy link
Member

dustymabe commented Sep 30, 2024

Thanks @fifofonix. I've got a few more questions (sorry!).

I've had at least one person report that installing Fedora Server 41 beta seems to work OK so maybe it's not GRUB and it is the way we've created the disk image itself (in the OVA). Is there a way you could try the "bare metal install" workflow using our ISO image (or PXE)? This would isolate the specific package set as the problem (i.e. where we previously suspected GRUB 2.12 as the problem) versus the built disk image as the problem.

@hrismarin
Copy link

hrismarin commented Oct 1, 2024

Is there a way you could try the "bare metal install" workflow using our ISO image (or PXE)?

At least on my side bare metal install using ISO image works.

$ sudo rpm-ostree status 
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Tue 2024-10-01 07:18:06 UTC)
Deployments:
● fedora:fedora/x86_64/coreos/next
                  Version: 41.20240922.1.0 (2024-09-23T17:19:23Z)
                   Commit: 9193342bf66c4b38fbf49d1d59af8a4e3f0c8ca4cb9d674ad3ba9713eea798c9
             GPGSignature: Valid signature by 466CF2D8B60BC3057AA9453ED0622462E99D6AD1

bootupd also seems to work and the system boots after the following commands.

core@fcos-next:~$ sudo bootupctl -vvvvvvv status
[TRACE bootupd] executing cli
Running as unit: bootupd.service
[TRACE bootupd] executing cli
[TRACE bootupd::bootupd] Gathering status for installed component: BIOS
[TRACE bootupd::bootupd] Gathering status for installed component: EFI
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::bootupd] Remaining known components: 0
Component BIOS
  Installed: grub2-tools-1:2.12-4.fc41.x86_64
  Update: At latest version
Component EFI
  Installed: grub2-efi-x64-1:2.12-4.fc41.x86_64,shim-x64-15.8-3.x86_64
  Update: At latest version
No components are adoptable.
CoreOS aleph version: 41.20240922.1.0
Boot method: BIOS
core@fcos-next:~$ sudo bootupctl -vvvvvvv update
[TRACE bootupd] executing cli
Running as unit: bootupd.service
[TRACE bootupd] executing cli
[TRACE bootupd::bootupd] Gathering status for installed component: BIOS
[TRACE bootupd::bootupd] Gathering status for installed component: EFI
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::bootupd] Remaining known components: 0
No update available for any component.
core@fcos-next:~$ sudo bootupctl -vvvvvvv validate
[TRACE bootupd] executing cli
Running as unit: bootupd.service
[TRACE bootupd] executing cli
[TRACE bootupd::bootupd] Gathering status for installed component: BIOS
[TRACE bootupd::bootupd] Gathering status for installed component: EFI
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::bootupd] Remaining known components: 0
Skipped: BIOS
[DEBUG bootupd::efi] Mounted at "/boot/efi"
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::efi] Unmounted
Validated: EFI

@dustymabe
Copy link
Member

At least on my side bare metal install using ISO image works.

Are you on VMWare?

@fifofonix
Copy link
Author

Booting the aarch64 live ISO on VMWare Fusion shows the Grub prompt and goes through to the live bash prompt. Is this sufficient to prove that Grub is not the issue or do I need to install to disk to complete this test?

Note this is slightly different to the original issue which is reported for x86. Do I need to find an old Mac to test the x86 live ISO too?

@dustymabe
Copy link
Member

Booting the aarch64 live ISO on VMWare Fusion shows the Grub prompt and goes through to the live bash prompt. Is this sufficient to prove that Grub is not the issue or do I need to install to disk to complete this test?

Note this is slightly different to the original issue which is reported for x86. Do I need to find an old Mac to test the x86 live ISO too?

Yeah - not switching out the architecture would be nice. Sorry I just thought you had a VMWare infra (other than your laptop) where you could run a test. It would be nice if we could try the test on the same architecture and same infra where you hit the original failures. I think that would be on x86_64, and yes, preferrably a full install to disk + reboot.

@fifofonix
Copy link
Author

fifofonix commented Oct 1, 2024

Had a colleague run the x86 ISO and install to disk and reboot on VMWare Workstation and everything goes well. This is an environment that fails when you try to install the OVA.

@hrismarin
Copy link

hrismarin commented Oct 1, 2024

Are you on VMWare?

Yes, I installed Windows 10 on a bare metal machine, installed VMWare Workstation 17 Player and then installed Fedora CoreOS next from the ISO. I haven't tried installing the OVA yet. Shall I try?

@dustymabe
Copy link
Member

Are you on VMWare?

Yes, I installed Windows 10 on a bare metal machine, installed VMWare Workstation 17 Player and then installed Fedora CoreOS next from the ISO.

Awesome. Thanks!

I haven't tried installing the OVA yet. Shall I try?

If you have time that would be great! More datapoints certainly help!

@dustymabe
Copy link
Member

Had a colleague run the x86 ISO and install to disk and reboot on VMWare Workstation and everything goes well. This is an environment that fails when you try to install the OVA.

Thanks! This should help us narrow down the root cause. I'm guessing now somehow some issue in how the OVA is built for one versus the other. Though it is the same code that is currently building testing and next so I'm not sure what the difference could be.

@hrismarin
Copy link

I can confirm that when I try to boot the next OVA image in VMWare Workstation on Windows, the issue is reproduced with the same error message from the bug description.

@dustymabe dustymabe added the meeting topics for meetings label Oct 2, 2024
@gursewak1997
Copy link
Member

From the community meeting:
@ravanelli will help dig this down and diagnose the issue further to find the root cause.

@gursewak1997 gursewak1997 removed the meeting topics for meetings label Oct 2, 2024
@dustymabe dustymabe added jira for syncing to jira F41 fallout/f41 labels Oct 2, 2024
@ravanelli
Copy link
Member

I also got the same issue using FCOS next (41) in VMWare Fusion on MAC (x86), FCOS 40 stable works just fine.
As a summary, we are guessing here the issue is around in how the OVA image is created and not around an issue in Grub?

@ravanelli
Copy link
Member

ravanelli commented Oct 2, 2024

Just adding thoughts here, the only thing that changed recently at our side was the osbuild part, maybe something before the OVA creation could be causing it?

Here is some diff between the 2 images, the sizes are slightly different, other than that, seems only the grub is different.

 diff /f40/ /f41/
Common subdirectories: /f40/boot and /f41/boot
diff /f40/bootupd-state.json /f41/bootupd-state.json
1c1
< {"installed":{"BIOS":{"meta":{"timestamp":"2024-05-29T15:31:22Z","version":"grub2-tools-1:2.06-123.fc40.x86_64"},"filetree":null,"adopted-fr}
\ No newline at end of file
---
> {"installed":{"BIOS":{"meta":{"timestamp":"2024-08-08T12:14:11Z","version":"grub2-tools-1:2.12-4.fc41.x86_64"},"filetree":null,"adopted-from}
\ No newline at end of file
Common subdirectories: /f40/coreos and /f41/coreos
Common subdirectories: /f40/efi and /f41/efi
Common subdirectories: /f40/grub2 and /f41/grub2
Common subdirectories: /f40/loader and /f41/loader
Common subdirectories: /f40/loader.1 and /f41/loader.1
Common subdirectories: /f40/lost+found and /f41/lost+found
Common subdirectories: /f40/ostree and /f41/ostree
## FCOS41:
GPT fdisk (gdisk) version 1.0.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk disk.raw: 20971520 sectors, 10.0 GiB
Sector size (logical): 512 bytes
Disk identifier (GUID): 00000000-0000-4000-A000-000000000001
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 5343198
Partitions will be aligned on 2048-sector boundaries
Total free space is 2015 sectors (1007.5 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            4095   1024.0 KiB  EF02  BIOS-BOOT
   2            4096          264191   127.0 MiB   EF00  EFI-SYSTEM
   3          264192         1050623   384.0 MiB   8300  boot
   4         1050624         5341183   2.0 GiB     8300  root
   
## FCOS40:

GPT fdisk (gdisk) version 1.0.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk ../f40/disk.raw: 20971520 sectors, 10.0 GiB
Sector size (logical): 512 bytes
Disk identifier (GUID): 00000000-0000-4000-A000-000000000001
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 5335006
Partitions will be aligned on 2048-sector boundaries
Total free space is 2015 sectors (1007.5 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            4095   1024.0 KiB  EF02  BIOS-BOOT
   2            4096          264191   127.0 MiB   EF00  EFI-SYSTEM
   3          264192         1050623   384.0 MiB   8300  boot
   4         1050624         5332991   2.0 GiB     8300  root

I will try to create an image with fcos#41-next downgrading the grub to see what it gives us.

@dustymabe
Copy link
Member

As a summary, we are guessing here the issue is around in how the OVA image is created and not around an issue in Grub?

According to the reported test results above it seems like it could be either. There's definitely something nuanced here.

I will try to create an image with fcos#41-next downgrading the grub to see what it gives us.

That will be a good test.

@ravanelli
Copy link
Member

ravanelli commented Oct 3, 2024

It is indeed a grub issue:

  • Built FCOS41 with latest packages and created VMWare OVA (next today) -> won't boot, same error

  • Built FCOS41 with latest packages, upgrading the grub packages as below and created VMWare OVA -> won't boot, same error:
    grub packages to: 2.12-7.fc42

  • Built FCOS41 with latest packages, downgrade grub and fuse as below, created VMWare OVA -> works, boots ok
    grub packages to 22.06.124.fc41,
    fuse-2.9.9-22.fc41 ,
    fuse-libs-2.9.9-22.fc41

@travier
Copy link
Member

travier commented Oct 3, 2024

Can you try https://bodhi.fedoraproject.org/updates/FEDORA-2024-a067416d33 ? That should narrow it to the 2.12 rebase.

@ravanelli
Copy link
Member

Fedora BZ opened: https://bugzilla.redhat.com/show_bug.cgi?id=2317048

@ravanelli
Copy link
Member

I added the options pager=0 and debug=all in the grub.cfg as suggested by Marta Lewandowska, still nothing shows up in VMware, it is failing even before it.

@dustymabe
Copy link
Member

Two more data points that might be helpful:

  • @fifofonix since you have a aarch64 Mac, can you test the OVA (not the ISO), does it work? our aarch64 images are UEFI only, so that may help us narrow down the cause.
  • @ravanelli is there a way to boot a machine on your x86 Mac with UEFI vs BIOS? We know at least one of them fails. Do both of them fail?

@marta-lewandowska
Copy link

@dustymabe f41 ova boots with BIOS; it is UEFI that is always failing.

@ravanelli
Copy link
Member

ravanelli commented Oct 15, 2024

We found the issue, the new serial configs for Grub 2.12 needs to have the port or the unit added to work.
Our VMware configs current have:
serial --speed=115200 which fails

Changing it to :
serial --unit=0 --speed=115200 works or either changing it to for example to
serial --port=mmio,fefb0000.l --speed=115200 also works.

Seems the fix for us it to work with --unit=0
--unit=0 refers to ttyS0, which is the first serial port in VMware if I'm not mistaken

However, as @dustymabe mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2317048#c10 it may be an issue for users trying to upgrade.

Thanks @marta-lewandowska for all your support and time spent on it!

@ozbenh
Copy link

ozbenh commented Oct 15, 2024

Ok, "serial" is rarely used with UEFI (usually I test with UEFI console), I'll look into it

@ozbenh
Copy link

ozbenh commented Oct 16, 2024

I found the bug in grub ACPI code, fix attached to the above RH bugzilla and will be on its way upstream soon

@dustymabe
Copy link
Member

Thanks so much @ozbenh!

@marta-lewandowska - could we get https://lists.gnu.org/archive/html/grub-devel/2024-10/msg00216.html backported to rawhide and Fedora 41?

@marta-lewandowska
Copy link

Thanks so much @ozbenh!

@marta-lewandowska - could we get https://lists.gnu.org/archive/html/grub-devel/2024-10/msg00216.html backported to rawhide and Fedora 41?

we're working on it. looks like upstream reviewed the patch, so we should be able to take it as is.

@dustymabe
Copy link
Member

and it's landed in https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d58433dd5

Thanks all!

ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
ravanelli added a commit to ravanelli/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
@dustymabe
Copy link
Member

new package fast-track in coreos/fedora-coreos-config#3190

@dustymabe dustymabe added the status/pending-next-release Fixed upstream. Waiting on a next release. label Oct 17, 2024
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue Oct 17, 2024
- VMWare OVA Fails To Boot due grub serial bug;
- Fast track packages with the fix.
See: coreos/fedora-coreos-tracker#1802

Signed-off-by: Renata Ravanelli <[email protected]>
@dustymabe
Copy link
Member

fixup in coreos/fedora-coreos-config#3209

@hrismarin
Copy link

fedora-coreos-41.20241017.10.0-vmware.x86_64.ova build works on VMware Player 17.6.1 (Windows 10).

@HuijingHei
Copy link
Member

Also did testing fedora-coreos-41.20241017.10.0-vmware.x86_64.ova on ESXi, VM can start successfully.

@ravanelli
Copy link
Member

Tested in Mac X86, with security boot, also worked fine!

@fifofonix
Copy link
Author

This morning our daily scheduled pipelines for the deployment of next canary nodes to VMWare vSphere succeeded. Yay!

@dustymabe
Copy link
Member

The fix for this went into next stream release 41.20241020.1.0. Please try out the new release and report issues.

@dustymabe
Copy link
Member

This issue never affected testing or stable streams.

@dustymabe dustymabe removed the status/pending-next-release Fixed upstream. Waiting on a next release. label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants