-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to rhel-coreos-9 #3485
Switch to rhel-coreos-9 #3485
Conversation
Skipping CI for Draft Pull Request. |
/test e2e-aws |
@cgwalters: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all |
Interesting...in the GCP run, the workers came up fine as rhel9, but the control plane stayed rhel8? https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3485/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1613636820694732800/artifacts/e2e-gcp-op/gather-extra/artifacts/oc_cmds/nodes |
3730efe
to
0ca052d
Compare
Oh duh I see, need to change the bootstrap path too. /test all |
/test e2e-gcp-op
The e2e-aws-ovn test... Hypershift...dunno? But let's do this a bit more seriously: |
@cgwalters: trigger 4 job(s) of type blocking for the ci release of OCP 4.13
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a3c9740-92ed-11ed-97df-8ca781b2fb9f-0 |
Looks like at least one SDN pod is trying to execute the host's copy of
|
Just wanted to help with the results here, it's awesome that we can run this now on a new OS image pre-merge. The payload is showing some clear regressions specifically with disruption related to ingress services, oauth and console. ingress-to-oauth-server-new-connections was unreachable during disruption testing for at least 8m36s of 1h28m43s (maxAllowed=10s) This is showing up in all 10 runs on For those unaware, for these disruptions a good starting point is to expand the intervals chart on the prow page (usually the first one), you will see when the disruption happened in time and be able to correlate that with whatever else was going on in the cluster. [sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
Upgrade is not succeeding at all on Azure but I heard that may already be known: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1613731229264252928 And "payload 4.13 nightly blocking" will also give some good coverage, different combos than CI, and a little more of them. |
/test e2e-gcp-op |
/test e2e-gcp-op-single-node |
Just to get some additional coverage |
/test e2e-aws-ovn-fips |
@cheesesashimi: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I don't personally have a lot of expertise on this. |
Running the FIPS and RT MCO jobs: /test e2e-aws-ovn-fips Note: The |
@cheesesashimi: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Interesting. The new jobs aren't showing up here yet. |
Few do, we may need the Ingress or SDN teams assistance to debug. But following the instructions I gave above, the outage appears to coincide with OVN alerts like "NoOvnRunningMaster", RouteHealth alerts for console and oauth, ovn-raft-quorum-guard PodDisruptionBudget, ovnkube-node target down, cloud network controller not scheduled on any nodes. It looks more like network than an ingress problem for sure. Additionally and separately, this test seems to be failing broadly: [sig-node] Ephemeral Containers [NodeConformance] will start an ephemeral container in an existing pod [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] |
/test ? |
@cheesesashimi: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-gcp-rt |
/test e2e-gcp-rt |
My findings and analysis for the additional CI jobs I've added can be found here: https://issues.redhat.com/browse/COS-1990 |
Yep, this works here too 🎉 e2e-gcp-op is green, and the payload job is good:
|
/payload-job periodic-ci-openshift-release-master-nightly-4.13-e2e-alibaba-ovn |
@sdodson: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3db6a430-bbc3-11ed-82a6-dc8a22a52c37-0 |
44955f4
to
7e28171
Compare
Rebased on the latest #3580 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first-glance, you'll need to re-integrate #3534 because QE found some issues with the directory permissions for /home/core/.ssh/authorized_keys.d
, which I've since fixed.
The short version is that since the MCO shells out to mkdir -p
instead of calling os.MkdirAll()
, only the inner-most directory gets created with the desired permissions (0700
in this case). The rest are created with 0755
.
7e28171
to
df28311
Compare
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written to the new location because: 1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location. 2. The node reboots into RHCOS 9 to complete the upgrade. 3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.
ref: https://issues.redhat.com/browse/COS-1983 We introduced a new `rhel-coreos-9` to aid having a switch be an atomic operation.
Unfortunately rpm-ostree requires this right now; we have an issue and code to provide a better API in coreos/rpm-ostree#2542 But using that will require shipping the updated rpm-ostree in RHEL 8.6.z or at least OCP 4.12.z, which is problematic. Because we know the new MCD will always be upgrading to RHEL9, for now let's update this hardcoded list. In the future we can detect when the running host has `--remove-installed-kernel` and use it instead.
Rapid file changes triggering the path unit can start the service here frequently, and then this can cause the start limit to be hit, and then systemd will refuse further activations (unless we bumped the limit). I don't think we need to synchronize the iptables rules more than once every 3 seconds.
df28311
to
d47101b
Compare
We merged #3596 instead which uses |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
daemon: Clean up
switchKernel
a bitDe-duplicate calls to
canonicalizeKernelType
to make thelogic easier to read. Also add a few comments.
vendor: Bump coreos/rpm-ostree-client-go
In prep for usage in MCD.
daemon: Make switchKernel less stateful
This is prep for fixing RHEL9 upgrades while maintaining
kernel-rt
.Previously the
switchKernel
logic tried to carefully handleall 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).
But, the last one (rt -> rt) was not quite right because
the previous
rpm-ostree rebase
command already preserved the previouskernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.
To say this another way: when doing a RHEL9 update, it's actually
the first
rpm-ostree rebase
command which fails before weeven get to
switchKernel
.And the reason is due to the introduction of a new
-core
subpackage;xref https://issues.redhat.com/browse/OCPBUGS-8113
So here's the new logic to handle this:
rebase
operation to the new OS, we detectany previous overrides of any packages starting with
kernel-rt
and we remove them. Notably this avoids hardcoding any specific
kernel subpackages; we just remove everything starting with
kernel-rt
which should be more robust to subpackage changesin the future.
rebase
operation will hence start out by deploying thestock image i.e. with throughput kernel (though note we are
carefully preserving other local overrides)
switchKernel
function now longer needs to take the previousmachineconfig state into account (except for logging).
Instead, we just detect if the target is RT, and if so we then we
apply the latest packages.
This significantly simplifies the logic in
switchKernel
, and willhelp fix RHEL9 upgrades.
Switch to rhel-coreos-9
ref: https://issues.redhat.com/browse/COS-1983
We introduced a new
rhel-coreos-9
to aid having a switch bean atomic operation.
daemon: Also override
kernel-modules-core
Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.
Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has
--remove-installed-kernel
anduse it instead.
openshift-azure-routes: Avoid synchronizing too quickly
Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).
I don't think we need to synchronize the iptables
rules more than once every 3 seconds.
ensures that RHCOS 9 SSH keys are in the right place
OKD release controller is out-of-date
ensures SSH keys get moved to the correct location
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:
teaches TestIgn3Cfg about the new RHCOS 9 key path