Switch to rhel-coreos-9 #3485

cgwalters · 2023-01-12T20:36:21Z

daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.

vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.

daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

Before we do the rebase operation to the new OS, we detect
any previous overrides of any packages starting with kernel-rt
and we remove them. Notably this avoids hardcoding any specific
kernel subpackages; we just remove everything starting with
kernel-rt which should be more robust to subpackage changes
in the future.
Consequently the rebase operation will hence start out by deploying the
stock image i.e. with throughput kernel (though note we are
carefully preserving other local overrides)
The switchKernel function now longer needs to take the previous
machineconfig state into account (except for logging).
Instead, we just detect if the target is RT, and if so we then we
apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.

Switch to rhel-coreos-9

ref: https://issues.redhat.com/browse/COS-1983

We introduced a new rhel-coreos-9 to aid having a switch be
an atomic operation.

daemon: Also override kernel-modules-core

Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has --remove-installed-kernel and
use it instead.

openshift-azure-routes: Avoid synchronizing too quickly

Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.

ensures that RHCOS 9 SSH keys are in the right place

OKD release controller is out-of-date

ensures SSH keys get moved to the correct location

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
The node reboots into RHCOS 9 to complete the upgrade.
The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

teaches TestIgn3Cfg about the new RHCOS 9 key path

openshift-ci · 2023-01-12T20:36:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cgwalters · 2023-01-12T20:36:31Z

/test e2e-aws

openshift-ci · 2023-01-12T20:36:36Z

@cgwalters: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test images
/test okd-scos-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-alibabacloud-ovn
/test e2e-aws-disruptive
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-upgrade
/test e2e-gcp-op-single-node
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-hypershift
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-gcp-op
/test okd-scos-e2e-gcp-ovn-upgrade
/test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-okd-scos-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-01-12T20:36:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2023-01-12T20:39:30Z

/test all

cgwalters · 2023-01-12T22:19:06Z

Interesting...in the GCP run, the workers came up fine as rhel9, but the control plane stayed rhel8? https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3485/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1613636820694732800/artifacts/e2e-gcp-op/gather-extra/artifacts/oc_cmds/nodes

cgwalters · 2023-01-12T22:22:55Z

Oh duh I see, need to change the bootstrap path too.

/test all

cgwalters · 2023-01-13T02:54:30Z

/test e2e-gcp-op
⬆️ this one only failed in deprovisioning AFAICS!

NAME                                        STATUS   ROLES                  AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
ci-op-qkwynd52-1354f-df8wh-master-0         Ready    control-plane,master   140m   v1.25.2+0003605   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-master-1         Ready    control-plane,master   140m   v1.25.2+0003605   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-master-2         Ready    control-plane,master   138m   v1.25.2+0003605   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-worker-a-w7vxv   Ready    worker                 123m   v1.25.2+0003605   10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-worker-b-ps97v   Ready    worker                 123m   v1.25.2+0003605   10.0.128.4    <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-worker-c-62c72   Ready    worker                 123m   v1.25.2+0003605   10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9

The e2e-aws-ovn test...
"[sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod"
sure looks like an existing flake.
/test e2e-aws-ovn

Hypershift...dunno?

But let's do this a bit more seriously:
/payload 4.13 ci blocking

openshift-ci · 2023-01-13T02:54:37Z

@cgwalters: trigger 4 job(s) of type blocking for the ci release of OCP 4.13

periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a3c9740-92ed-11ed-97df-8ca781b2fb9f-0

cgwalters · 2023-01-13T21:17:53Z

Looks like at least one SDN pod is trying to execute the host's copy of oc in the container userspace, which won't work since it now needs a newer glibc:

/host/usr/bin/oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by /host/usr/bin/oc)

➡️ https://issues.redhat.com/browse/OCPBUGS-5842

dgoodwin · 2023-01-18T17:59:58Z

@cgwalters: trigger 4 job(s) of type blocking for the ci release of OCP 4.13
* periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade

* periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-serial
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a3c9740-92ed-11ed-97df-8ca781b2fb9f-0

Just wanted to help with the results here, it's awesome that we can run this now on a new OS image pre-merge. The payload is showing some clear regressions specifically with disruption related to ingress services, oauth and console.

ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s)

This is showing up in all 10 runs on
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade/1613731214009569280 (aws ovn micro upgrade)

For those unaware, for these disruptions a good starting point is to expand the intervals chart on the prow page (usually the first one), you will see when the disruption happened in time and be able to correlate that with whatever else was going on in the cluster.

[sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

this failed on all ten runs for:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade/1613731214009569280
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1613731219789320192

Upgrade is not succeeding at all on Azure but I heard that may already be known: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1613731229264252928

And "payload 4.13 nightly blocking" will also give some good coverage, different combos than CI, and a little more of them.

cheesesashimi · 2023-01-18T19:08:03Z

/test e2e-gcp-op

cheesesashimi · 2023-01-18T19:23:59Z

/test e2e-gcp-op-single-node

cgwalters · 2023-01-18T19:24:00Z

Just to get some additional coverage
/test e2e-aws-single-node
/test e2e-gcp-op-single-node
/test e2e-metal-assisted

cheesesashimi · 2023-01-18T19:24:56Z

/test e2e-aws-ovn-fips

openshift-ci · 2023-01-18T19:25:02Z

@cheesesashimi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test images
/test okd-scos-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-alibabacloud-ovn
/test e2e-aws-disruptive
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-upgrade
/test e2e-gcp-op-single-node
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-hypershift
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-gcp-op
/test okd-scos-e2e-gcp-ovn-upgrade
/test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-okd-scos-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test e2e-aws-ovn-fips

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2023-01-19T18:39:18Z

ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s)

I don't personally have a lot of expertise on this.

cheesesashimi · 2023-01-19T19:14:34Z

Running the FIPS and RT MCO jobs:

/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-gcp-rt
/test e2e-gcp-rt-op

Note: The e2e-gcp-rt-op job is expected to fail with either (or both) TestKernelArguments and TestKernelType tests failing. I've opened #3504 to green those up.

openshift-ci · 2023-01-19T19:14:55Z

@cheesesashimi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test images
/test okd-scos-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-alibabacloud-ovn
/test e2e-aws-disruptive
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-upgrade
/test e2e-gcp-op-single-node
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-hypershift
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-gcp-op
/test okd-scos-e2e-gcp-ovn-upgrade
/test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-okd-scos-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

Running the FIPS and RT MCO jobs:

/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-gcp-rt
/test e2e-gcp-rt-op

Note: The e2e-gcp-rt-op job is expected to fail with either (or both) TestKernelArguments and TestKernelType tests failing. I've opened #3504 to green those up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheesesashimi · 2023-01-19T19:15:56Z

Interesting. The new jobs aren't showing up here yet.

dgoodwin · 2023-01-20T13:37:36Z

ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s)

I don't personally have a lot of expertise on this.

Few do, we may need the Ingress or SDN teams assistance to debug. But following the instructions I gave above, the outage appears to coincide with OVN alerts like "NoOvnRunningMaster", RouteHealth alerts for console and oauth, ovn-raft-quorum-guard PodDisruptionBudget, ovnkube-node target down, cloud network controller not scheduled on any nodes. It looks more like network than an ingress problem for sure.

Additionally and separately, this test seems to be failing broadly: [sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

cheesesashimi · 2023-01-23T16:15:01Z

/test ?

openshift-ci · 2023-01-23T16:15:05Z

@cheesesashimi: The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test images
/test okd-scos-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-alibabacloud-ovn
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-upgrade
/test e2e-gcp-op-single-node
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-hypershift
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-gcp-op
/test okd-scos-e2e-gcp-ovn-upgrade
/test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-fips
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-fips-op
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-rt-op
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-okd-scos-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheesesashimi · 2023-01-23T16:15:44Z

/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op

cheesesashimi · 2023-01-26T14:42:37Z

/test e2e-gcp-rt
/test e2e-aws-ovn-fips
/test e2e-aws-ovn

cheesesashimi · 2023-01-26T20:09:35Z

My findings and analysis for the additional CI jobs I've added can be found here: https://issues.redhat.com/browse/COS-1990

cgwalters · 2023-03-05T12:53:46Z

Yep, this works here too 🎉

e2e-gcp-op is green, and the payload job is good:

Deployments:
* ostree-unverified-registry:registry.build01.ci.openshift.org/ci-op-jky2jhft/stable@sha256:ab2fa6f321f12af1e45f19d928199aa7fc4a6341aa77d52607bdab7d93ba130b
                   Digest: sha256:ab2fa6f321f12af1e45f19d928199aa7fc4a6341aa77d52607bdab7d93ba130b
                  Version: 413.92.202303011445-0 (2023-03-05T01:05:59Z)
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-core kernel-modules-extra 5.14.0-282.el9
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra

  (error fetching image metadata)
                Timestamp: 2023-03-04T23:38:13Z
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.46.1.el8_6
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra

sdodson · 2023-03-06T02:04:34Z

/payload-job periodic-ci-openshift-release-master-nightly-4.13-e2e-alibaba-ovn

openshift-ci · 2023-03-06T02:04:36Z

@sdodson: trigger 1 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-nightly-4.13-e2e-alibaba-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3db6a430-bbc3-11ed-82a6-dc8a22a52c37-0

cgwalters · 2023-03-08T13:54:25Z

Rebased on the latest #3580
Also lifting draft since we have to merge this soon

cheesesashimi

At first-glance, you'll need to re-integrate #3534 because QE found some issues with the directory permissions for /home/core/.ssh/authorized_keys.d, which I've since fixed.

The short version is that since the MCO shells out to mkdir -p instead of calling os.MkdirAll(), only the inner-most directory gets created with the desired permissions (0700 in this case). The rest are created with 0755.

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written to the new location because: 1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location. 2. The node reboots into RHCOS 9 to complete the upgrade. 3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

ref: https://issues.redhat.com/browse/COS-1983 We introduced a new `rhel-coreos-9` to aid having a switch be an atomic operation.

Unfortunately rpm-ostree requires this right now; we have an issue and code to provide a better API in coreos/rpm-ostree#2542 But using that will require shipping the updated rpm-ostree in RHEL 8.6.z or at least OCP 4.12.z, which is problematic. Because we know the new MCD will always be upgrading to RHEL9, for now let's update this hardcoded list. In the future we can detect when the running host has `--remove-installed-kernel` and use it instead.

Rapid file changes triggering the path unit can start the service here frequently, and then this can cause the start limit to be hit, and then systemd will refuse further activations (unless we bumped the limit). I don't think we need to synchronize the iptables rules more than once every 3 seconds.

cgwalters · 2023-03-08T18:30:23Z

We merged #3596 instead which uses rhel-coreos.

openshift-ci · 2023-03-08T18:33:55Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure-ovn-upgrade	d47b5afb199129fa15d2e2891dd80ecc4b449fbe	link	false	`/test e2e-azure-ovn-upgrade`
ci/prow/e2e-gcp-ovn-rt-upgrade	2703cd0a45868f60a8a09fbf09114228c522aefc	link	false	`/test e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-alibabacloud-ovn	`d47101b`	link	false	`/test e2e-alibabacloud-ovn`
ci/prow/e2e-hypershift	`d47101b`	link	false	`/test e2e-hypershift`
ci/prow/okd-scos-e2e-gcp-ovn-upgrade	`d47101b`	link	false	`/test okd-scos-e2e-gcp-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

This changed in openshift/machine-config-operator#3485

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2023

cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from 3730efe to 0ca052d Compare January 12, 2023 22:22

cheesesashimi mentioned this pull request Feb 1, 2023

OCPBUGS-6945: Fixes node OS detection #3529

Merged

cgwalters mentioned this pull request Mar 6, 2023

daemon: Wait between high availability control plane node updates #3586

Closed

cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from 44955f4 to 7e28171 Compare March 8, 2023 13:52

cgwalters changed the title ~~Add rhel-coreos-9 references~~ Switch to rhel-coreos-9 Mar 8, 2023

cgwalters marked this pull request as ready for review March 8, 2023 13:54

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 8, 2023

openshift-ci bot requested review from cheesesashimi and yuqi-zhang March 8, 2023 13:55

cheesesashimi suggested changes Mar 8, 2023

View reviewed changes

cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from 7e28171 to df28311 Compare March 8, 2023 15:14

cheesesashimi and others added 8 commits March 8, 2023 10:43

ensures that RHCOS 9 SSH keys are in the right place

4c212ef

OKD release controller is out-of-date

fc98d02

teaches TestIgn3Cfg about the new RHCOS 9 key path

9412f1d

checks perms for SSH key path dirs as well

14d30fe

Switch to rhel-coreos-9

3d7f3c1

ref: https://issues.redhat.com/browse/COS-1983 We introduced a new `rhel-coreos-9` to aid having a switch be an atomic operation.

cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from df28311 to d47101b Compare March 8, 2023 15:49

This was referenced Mar 8, 2023

OCPBUGS-8703: Backport switchkernel 4.13 #3595

Merged

Switch to rhel-coreos (9) #3596

Merged

cgwalters closed this Mar 8, 2023

cgwalters mentioned this pull request Apr 10, 2023

layering: Use rhel-coreos now openshift/openshift-docs#58486

Closed

mburke5678 pushed a commit to mburke5678/openshift-docs that referenced this pull request Apr 11, 2023

layering: Use rhel-coreos now

073d1c0

This changed in openshift/machine-config-operator#3485

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to rhel-coreos-9 #3485

Switch to rhel-coreos-9 #3485

cgwalters commented Jan 12, 2023 •

edited

Loading

openshift-ci bot commented Jan 12, 2023

cgwalters commented Jan 12, 2023

openshift-ci bot commented Jan 12, 2023

openshift-ci bot commented Jan 12, 2023

cgwalters commented Jan 12, 2023

cgwalters commented Jan 12, 2023

cgwalters commented Jan 12, 2023

cgwalters commented Jan 13, 2023

openshift-ci bot commented Jan 13, 2023

cgwalters commented Jan 13, 2023 •

edited

Loading

dgoodwin commented Jan 18, 2023

cheesesashimi commented Jan 18, 2023

cheesesashimi commented Jan 18, 2023

cgwalters commented Jan 18, 2023

cheesesashimi commented Jan 18, 2023

openshift-ci bot commented Jan 18, 2023

cgwalters commented Jan 19, 2023

cheesesashimi commented Jan 19, 2023

openshift-ci bot commented Jan 19, 2023

cheesesashimi commented Jan 19, 2023

dgoodwin commented Jan 20, 2023

cheesesashimi commented Jan 23, 2023

openshift-ci bot commented Jan 23, 2023

cheesesashimi commented Jan 23, 2023

cheesesashimi commented Jan 26, 2023

cheesesashimi commented Jan 26, 2023

cgwalters commented Mar 5, 2023

sdodson commented Mar 6, 2023

openshift-ci bot commented Mar 6, 2023

cgwalters commented Mar 8, 2023

cheesesashimi left a comment

cgwalters commented Mar 8, 2023

openshift-ci bot commented Mar 8, 2023

Switch to rhel-coreos-9 #3485

Switch to rhel-coreos-9 #3485

Conversation

cgwalters commented Jan 12, 2023 • edited Loading

openshift-ci bot commented Jan 12, 2023

cgwalters commented Jan 12, 2023

openshift-ci bot commented Jan 12, 2023

openshift-ci bot commented Jan 12, 2023

cgwalters commented Jan 12, 2023

cgwalters commented Jan 12, 2023

cgwalters commented Jan 12, 2023

cgwalters commented Jan 13, 2023

openshift-ci bot commented Jan 13, 2023

cgwalters commented Jan 13, 2023 • edited Loading

dgoodwin commented Jan 18, 2023

cheesesashimi commented Jan 18, 2023

cheesesashimi commented Jan 18, 2023

cgwalters commented Jan 18, 2023

cheesesashimi commented Jan 18, 2023

openshift-ci bot commented Jan 18, 2023

cgwalters commented Jan 19, 2023

cheesesashimi commented Jan 19, 2023

openshift-ci bot commented Jan 19, 2023

cheesesashimi commented Jan 19, 2023

dgoodwin commented Jan 20, 2023

cheesesashimi commented Jan 23, 2023

openshift-ci bot commented Jan 23, 2023

cheesesashimi commented Jan 23, 2023

cheesesashimi commented Jan 26, 2023

cheesesashimi commented Jan 26, 2023

cgwalters commented Mar 5, 2023

sdodson commented Mar 6, 2023

openshift-ci bot commented Mar 6, 2023

cgwalters commented Mar 8, 2023

cheesesashimi left a comment

Choose a reason for hiding this comment

cgwalters commented Mar 8, 2023

openshift-ci bot commented Mar 8, 2023

cgwalters commented Jan 12, 2023 •

edited

Loading

cgwalters commented Jan 13, 2023 •

edited

Loading