Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to rhel-coreos-9 #3485

Closed

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Jan 12, 2023

daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.


Switch to rhel-coreos-9

ref: https://issues.redhat.com/browse/COS-1983

We introduced a new rhel-coreos-9 to aid having a switch be
an atomic operation.


daemon: Also override kernel-modules-core

Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has --remove-installed-kernel and
use it instead.


openshift-azure-routes: Avoid synchronizing too quickly

Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.


ensures that RHCOS 9 SSH keys are in the right place


OKD release controller is out-of-date


ensures SSH keys get moved to the correct location

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

  1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
  2. The node reboots into RHCOS 9 to complete the upgrade.
  3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

teaches TestIgn3Cfg about the new RHCOS 9 key path


@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 12, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@cgwalters
Copy link
Member Author

/test e2e-aws

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 12, 2023

@cgwalters: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-hypershift
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2023
@cgwalters
Copy link
Member Author

/test all

@cgwalters cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from 3730efe to 0ca052d Compare January 12, 2023 22:22
@cgwalters
Copy link
Member Author

Oh duh I see, need to change the bootstrap path too.

/test all

@cgwalters
Copy link
Member Author

/test e2e-gcp-op
⬆️ this one only failed in deprovisioning AFAICS!

NAME                                        STATUS   ROLES                  AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
ci-op-qkwynd52-1354f-df8wh-master-0         Ready    control-plane,master   140m   v1.25.2+0003605   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-master-1         Ready    control-plane,master   140m   v1.25.2+0003605   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-master-2         Ready    control-plane,master   138m   v1.25.2+0003605   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-worker-a-w7vxv   Ready    worker                 123m   v1.25.2+0003605   10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-worker-b-ps97v   Ready    worker                 123m   v1.25.2+0003605   10.0.128.4    <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9
ci-op-qkwynd52-1354f-df8wh-worker-c-62c72   Ready    worker                 123m   v1.25.2+0003605   10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)   5.14.0-70.36.1.el9_0.x86_64   cri-o://1.25.0-53.rhaos4.12.git2002c49.el9

The e2e-aws-ovn test...
"[sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod"
sure looks like an existing flake.
/test e2e-aws-ovn

Hypershift...dunno?

But let's do this a bit more seriously:
/payload 4.13 ci blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2023

@cgwalters: trigger 4 job(s) of type blocking for the ci release of OCP 4.13

  • periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a3c9740-92ed-11ed-97df-8ca781b2fb9f-0

@cgwalters
Copy link
Member Author

cgwalters commented Jan 13, 2023

Looks like at least one SDN pod is trying to execute the host's copy of oc in the container userspace, which won't work since it now needs a newer glibc:

 /host/usr/bin/oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by /host/usr/bin/oc)

➡️ https://issues.redhat.com/browse/OCPBUGS-5842

@dgoodwin
Copy link
Contributor

@cgwalters: trigger 4 job(s) of type blocking for the ci release of OCP 4.13

* periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade

* periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a3c9740-92ed-11ed-97df-8ca781b2fb9f-0

Just wanted to help with the results here, it's awesome that we can run this now on a new OS image pre-merge. The payload is showing some clear regressions specifically with disruption related to ingress services, oauth and console.

ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s)

This is showing up in all 10 runs on
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade/1613731214009569280 (aws ovn micro upgrade)

For those unaware, for these disruptions a good starting point is to expand the intervals chart on the prow page (usually the first one), you will see when the disruption happened in time and be able to correlate that with whatever else was going on in the cluster.

[sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

Upgrade is not succeeding at all on Azure but I heard that may already be known: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1613731229264252928

And "payload 4.13 nightly blocking" will also give some good coverage, different combos than CI, and a little more of them.

@cheesesashimi
Copy link
Member

/test e2e-gcp-op

@cheesesashimi
Copy link
Member

/test e2e-gcp-op-single-node

@cgwalters
Copy link
Member Author

Just to get some additional coverage
/test e2e-aws-single-node
/test e2e-gcp-op-single-node
/test e2e-metal-assisted

@cheesesashimi
Copy link
Member

/test e2e-aws-ovn-fips

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 18, 2023

@cheesesashimi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-hypershift
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test e2e-aws-ovn-fips

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s)

I don't personally have a lot of expertise on this.

@cheesesashimi
Copy link
Member

Running the FIPS and RT MCO jobs:

/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-gcp-rt
/test e2e-gcp-rt-op

Note: The e2e-gcp-rt-op job is expected to fail with either (or both) TestKernelArguments and TestKernelType tests failing. I've opened #3504 to green those up.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2023

@cheesesashimi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-hypershift
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

Running the FIPS and RT MCO jobs:

/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-gcp-rt
/test e2e-gcp-rt-op

Note: The e2e-gcp-rt-op job is expected to fail with either (or both) TestKernelArguments and TestKernelType tests failing. I've opened #3504 to green those up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cheesesashimi
Copy link
Member

Interesting. The new jobs aren't showing up here yet.

@dgoodwin
Copy link
Contributor

ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s)

I don't personally have a lot of expertise on this.

Few do, we may need the Ingress or SDN teams assistance to debug. But following the instructions I gave above, the outage appears to coincide with OVN alerts like "NoOvnRunningMaster", RouteHealth alerts for console and oauth, ovn-raft-quorum-guard PodDisruptionBudget, ovnkube-node target down, cloud network controller not scheduled on any nodes. It looks more like network than an ingress problem for sure.

Additionally and separately, this test seems to be failing broadly: [sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

@cheesesashimi
Copy link
Member

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 23, 2023

@cheesesashimi: The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-hypershift
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-fips
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-fips-op
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-rt-op
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cheesesashimi
Copy link
Member

/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op

@cheesesashimi
Copy link
Member

/test e2e-gcp-rt
/test e2e-aws-ovn-fips
/test e2e-aws-ovn

@cheesesashimi
Copy link
Member

My findings and analysis for the additional CI jobs I've added can be found here: https://issues.redhat.com/browse/COS-1990

@cgwalters
Copy link
Member Author

Yep, this works here too 🎉

e2e-gcp-op is green, and the payload job is good:

Deployments:
* ostree-unverified-registry:registry.build01.ci.openshift.org/ci-op-jky2jhft/stable@sha256:ab2fa6f321f12af1e45f19d928199aa7fc4a6341aa77d52607bdab7d93ba130b
                   Digest: sha256:ab2fa6f321f12af1e45f19d928199aa7fc4a6341aa77d52607bdab7d93ba130b
                  Version: 413.92.202303011445-0 (2023-03-05T01:05:59Z)
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-core kernel-modules-extra 5.14.0-282.el9
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra

  (error fetching image metadata)
                Timestamp: 2023-03-04T23:38:13Z
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.46.1.el8_6
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra

@sdodson
Copy link
Member

sdodson commented Mar 6, 2023

/payload-job periodic-ci-openshift-release-master-nightly-4.13-e2e-alibaba-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2023

@sdodson: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-nightly-4.13-e2e-alibaba-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3db6a430-bbc3-11ed-82a6-dc8a22a52c37-0

@cgwalters cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from 44955f4 to 7e28171 Compare March 8, 2023 13:52
@cgwalters cgwalters changed the title Add rhel-coreos-9 references Switch to rhel-coreos-9 Mar 8, 2023
@cgwalters cgwalters marked this pull request as ready for review March 8, 2023 13:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 8, 2023
@cgwalters
Copy link
Member Author

Rebased on the latest #3580
Also lifting draft since we have to merge this soon

Copy link
Member

@cheesesashimi cheesesashimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first-glance, you'll need to re-integrate #3534 because QE found some issues with the directory permissions for /home/core/.ssh/authorized_keys.d, which I've since fixed.

The short version is that since the MCO shells out to mkdir -p instead of calling os.MkdirAll(), only the inner-most directory gets created with the desired permissions (0700 in this case). The rest are created with 0755.

@cgwalters cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from 7e28171 to df28311 Compare March 8, 2023 15:14
cheesesashimi and others added 8 commits March 8, 2023 10:43
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
2. The node reboots into RHCOS 9 to complete the upgrade.
3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.
ref: https://issues.redhat.com/browse/COS-1983

We introduced a new `rhel-coreos-9` to aid having a switch be
an atomic operation.
Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list.  In the future we can
detect when the running host has `--remove-installed-kernel` and
use it instead.
Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.
@cgwalters cgwalters force-pushed the attempt-to-add-rhel-coreos-9 branch from df28311 to d47101b Compare March 8, 2023 15:49
@cgwalters
Copy link
Member Author

We merged #3596 instead which uses rhel-coreos.

@cgwalters cgwalters closed this Mar 8, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 8, 2023

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade d47b5afb199129fa15d2e2891dd80ecc4b449fbe link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-gcp-ovn-rt-upgrade 2703cd0a45868f60a8a09fbf09114228c522aefc link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-alibabacloud-ovn d47101b link false /test e2e-alibabacloud-ovn
ci/prow/e2e-hypershift d47101b link false /test e2e-hypershift
ci/prow/okd-scos-e2e-gcp-ovn-upgrade d47101b link false /test okd-scos-e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mburke5678 pushed a commit to mburke5678/openshift-docs that referenced this pull request Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants