Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greenboot boot_counter does not decrement if sudo rpm-ostree reset is run before reboot #107

Closed
dhensel-rh opened this issue Jul 5, 2023 · 5 comments

Comments

@dhensel-rh
Copy link

dhensel-rh commented Jul 5, 2023

Issue:
When rpm-ostree command removes a package, and a rpm-ostree reset (Remove all mutations) is performed, it affects the behavior of Greenboot. After a reboot is initiated, Greenboot does not perform the check, and the Greenboot boot_counter stats at the set default value.

Steps to reproduce:

  1. Deploy a RHEL system with Greenboot and rpm-ostree both actively installed
  2. Remove a package rpm-ostree override remove hostname
  3. Reset rpm-ostree rpm-ostree reset
  4. Perform a system reboot

Expected Result:
Greenboot should noticed a package is missing and attempt to restore the last known good state

Actual Result:
Greenboot does not attempt to fix itself. Greenboot boot_counter remains set at the default set value

additional notes (is any):
The ostree status:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
Greenboot variables:
boot_counter=2

A system reboot will clear the boot flag, and restore the system to a good known state

@say-paul
Copy link
Member

say-paul commented Jul 12, 2023

@dhensel-rh rpm-ostree reset as per the documentation says removes any mutation, so when the package hostname that gets removed as part of rpm-ostree override remove hostname gets restored when reset is triggered, You can test it by checking rpm-ostree status before and after step-2.

Though I am not sure why the boot_counter is still set, please share the journald log post reboot of the services: greenboot-grub2-set-counter, greenboot-healthcheck, greenboot-grub2-set-success

@miabbott
Copy link
Member

This looks like https://bugzilla.redhat.com/show_bug.cgi?id=2185901 ?

@LorbusChris
Copy link
Member

LorbusChris commented Jul 19, 2023

If I'm not mistaken rpm-ostree override remove <pkg> will trigger ostree-finalize-staged.service, which will pull in greenboot-grub2-set-counter.service with ExecStart=/usr/libexec/greenboot/greenboot-grub2-set-counter.

It's possible that also rpm-ostree reset triggers ostree-finalize-staged.service again (I don't know whether it does).

Either way, there is nothing telling grub to unset the boot_counter variable again in this case.
If rpm-ostree reset triggers ostree-finalize-staged.service a second time, it might suffice to make the greenboot-grub2-set-counter script smarter here (e.g. by checking rpm-ostree status and somehow determining that the last action was reset, and then unsetting the boot_counter var).

@dhensel-rh
Copy link
Author

sudo journalctl -o cat -u greenboot-grub2-set-success

Starting Mark boot as successful in grubenv...
Finished Mark boot as successful in grubenv.

sudo journalctl -o cat -u greenboot-grub2-set-counter

Starting Set grub2 boot counter in preparation of upgrade...
GRUB2 environment variables have been set for system upgrade. Max boot attempts is 3
Finished Set grub2 boot counter in preparation of upgrade.
greenboot-grub2-set-counter.service: Deactivated successfully.
Stopped Set grub2 boot counter in preparation of upgrade.

sudo journalctl -o cat -u greenboot-healthcheck

Starting greenboot Health Checks Runner...
Running Required Health Check Scripts...
Running greenboot Required Health Check Scripts
Script '00_required_scripts_start.sh' SUCCESS
No domain names have been found
Script '01_repository_dns_check.sh' SUCCESS
No watchdog on the system, skipping check
Script '02_watchdog.sh' SUCCESS
Running Wanted Health Check Scripts...
Script '00_wanted_scripts_start.sh' SUCCESS
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Running Required Health Check Scripts...
STARTED
GRUB boot variables:
boot_success=0
boot_indeterminate=0
Greenboot variables:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
The ostree status:
* rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.0
    Version: 9.2
    origin refspec: edge:rhel/9/aarch64/edge
Waiting 300s for MicroShift service to be active and not failed
Waiting 300s for MicroShift API health endpoints to be OK
Waiting 300s for any pods to be running
Waiting 300s for pod image(s) from the 'openshift-ovn-kubernetes' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-service-ca' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-ingress' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-dns' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-storage' namespace to be downloaded
Waiting 300s for pod image(s) from the 'kube-system' namespace to be downloaded
Waiting 300s for 2 pod(s) from the 'openshift-ovn-kubernetes' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-service-ca' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-ingress' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-dns' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-storage' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'kube-system' namespace to be in 'Ready' state
Checking pod restart count in the 'openshift-ovn-kubernetes' namespace
Checking pod restart count in the 'openshift-service-ca' namespace
Checking pod restart count in the 'openshift-ingress' namespace
Checking pod restart count in the 'openshift-dns' namespace
Checking pod restart count in the 'openshift-storage' namespace
Checking pod restart count in the 'kube-system' namespace
FINISHED
Script '40_microshift_running_check.sh' SUCCESS
Running Wanted Health Check Scripts...
Finished greenboot Health Checks Runner.
greenboot-healthcheck.service: Deactivated successfully.
Stopped greenboot Health Checks Runner.
greenboot-healthcheck.service: Consumed 57.582s CPU time.
Starting greenboot Health Checks Runner...
Running Required Health Check Scripts...
Running greenboot Required Health Check Scripts
Script '00_required_scripts_start.sh' SUCCESS
No domain names have been found
Script '01_repository_dns_check.sh' SUCCESS
Script '02_watchdog.sh' SUCCESS
Running Wanted Health Check Scripts...
Running greenboot Wanted Health Check Scripts
Script '00_wanted_scripts_start.sh' SUCCESS
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Running Required Health Check Scripts...
STARTED
GRUB boot variables:
boot_success=0
boot_indeterminate=0
boot_counter=3
Greenboot variables:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
The ostree status:
* rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.1
    Version: 9.2
    origin refspec: edge:rhel/9/aarch64/edge
  rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.0 (rollback)
    Version: 9.2
    origin refspec: edge:rhel/9/aarch64/edge
Waiting 300s for MicroShift service to be active and not failed
Waiting 300s for MicroShift API health endpoints to be OK
Waiting 300s for any pods to be running
Waiting 300s for pod image(s) from the 'openshift-ovn-kubernetes' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-service-ca' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-ingress' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-dns' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-storage' namespace to be downloaded
Waiting 300s for pod image(s) from the 'kube-system' namespace to be downloaded
Waiting 300s for 2 pod(s) from the 'openshift-ovn-kubernetes' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-service-ca' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-ingress' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-dns' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-storage' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'kube-system' namespace to be in 'Ready' state
Checking pod restart count in the 'openshift-ovn-kubernetes' namespace
Checking pod restart count in the 'openshift-service-ca' namespace
Checking pod restart count in the 'openshift-ingress' namespace
Checking pod restart count in the 'openshift-dns' namespace
Checking pod restart count in the 'openshift-storage' namespace
Checking pod restart count in the 'kube-system' namespace
FINISHED
Script '40_microshift_running_check.sh' SUCCESS
Running Wanted Health Check Scripts...
Finished greenboot Health Checks Runner.

@say-paul
Copy link
Member

say-paul commented Sep 12, 2024

So greenboot needs as a failure in healthcehck script to start taking any action(reboot/rollback), A need to add a script possibly in the required.d/ which can check for the packages of interest and return failure in case of any discrepancies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants