Bump to Fedora 40 #3785

jlebon · 2024-04-25T18:14:48Z

Some of our upstream CIs (ostree, rpm-ostree) require cosa and FCOS to be on the same release. Ideally we'd fix that but there's details there and we want to move cosa anyway.

jlebon · 2024-04-25T18:15:08Z

Didn't test this at all. Let's see what CI says.

jlebon · 2024-04-25T18:15:13Z

openshift/release PR: openshift/release#51370

jlebon · 2024-04-25T18:19:48Z

(Testing locally as well in parallel now.)

Let's also push a release and add a Quay.io tag before merging this.

dustymabe · 2024-04-25T18:25:54Z

Let's also push a release and add a Quay.io tag before merging this.

agree. Ideally we build the next stable with at least a similar base as to what testing was done with.

jlebon · 2024-04-25T18:32:16Z

Prow needs openshift/release#51370.

jlebon · 2024-04-25T19:16:07Z

/retest

jmarrero

lgtm

travier · 2024-04-26T08:15:25Z

/test ci/prow/images
/test ci/prow/rhcos

openshift-ci · 2024-04-26T08:15:29Z

@travier: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test images
/test rhcos

Use /test all to run all jobs.

In response to this:

/test ci/prow/images
/test ci/prow/rhcos

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

travier · 2024-04-26T08:15:44Z

/test images
/test rhcos

jlebon · 2024-04-26T14:17:26Z

CoreOS CI hanging at the cosa fetch --strict step. Possibly something going wrong with supermin. Prow is timing out, likely because of the same issue but for some reason we're not getting any logs there.

jlebon · 2024-04-29T19:09:38Z

Seems related to virtio-serial writes from the guest side sometimes hanging for some reason. (I.e. writes to /dev/virtio-ports/cosa-cmdout.)

jlebon · 2024-04-29T21:09:09Z

CoreOS CI hanging at the cosa fetch --strict step.

OK, latest commit seems to have fixed it! Looked a bit through git log v8.1.3..v8.2.2 in QEMU to see if anything obvious pops out but didn't see anything.

dustymabe · 2024-04-30T13:32:21Z

since we have to run CI again maybe let's update: tests/containers/tang/Containerfile too.

jlebon · 2024-04-30T21:13:34Z

OK weird, debugging in the pod, it looks like Prow is still hitting the same hanging issue that I thought 7857488 (#3785) fixed. And even more fun, I can't get this hang to reproduce when running manually in the pod. So I think there's a race somewhere and the commit just made it less likely.

Anyway, this now sounds like possibly some bug when combining virtio-serial and stdio. I think I'll just rework this to use a regular serial device instead of virtio-serial since that's obviously way more battle-tested.

jlebon · 2024-05-01T15:55:20Z

OK, ran out of cycles trying to debug this. I've ended having to essentially revert 4eb19f4, which is unfortunate. But at least it passes CI in both Prow and CoreOS CI.

I think I'll just rework this to use a regular serial device instead of virtio-serial

The problem with this is that it doesn't work on all arches. E.g. on aarch64, adding another --serial doesn't create a /dev/ttyAMA1 device.

jlebon · 2024-05-01T15:57:45Z

Have some work to try to create a minimal/self-contained reproducer to file a bug, but it's proving trickier than expected.

Some of our upstream CIs (ostree, rpm-ostree) require cosa and FCOS to be on the same release. Ideally we'd fix that but there's details there and we want to move cosa anyway.

This is more or less a revert of 4eb19f4. It seems like QEMU v8.2.2 (in Fedora 40) is hitting issues when combining virtio-serial ports and the stdio character device. When the guest writes to the virtio-serial port, it sometimes hangs. We can look at reverting this patch if it works again in a future version.

jlebon · 2024-05-01T16:07:02Z

Since CI already passed on this, let's just merge it in to unbreak CI and get to any other fallout faster.

jlebon · 2024-05-06T18:16:22Z

src/cmdlib.sh

@@ -842,6 +845,9 @@ EOF
    fi
    rc="$(cat "${rc_file}")"

+    # cleanup tail before nuking dir containing file it's following
+    kill "$tail_pid"


I think there's a potential race here where tail could be killed before it finished actually printing all the output, even though qemu already exited. A simple fix is to just e.g. sleep 1 or whatever but ughhh. Really wish we could go back to the virtio-serial approach.

And indeed: openshift/os#1498 (comment)

I can't reproduce this locally, but I have a suspicion that `tail` can exit too quickly in some circumstances, causing truncated output: openshift/os#1498 (comment) coreos#3785 (comment) Rather than having an unconditional `sleep`, let's make it easier to test that theory by having an env var we can use to make it optional. Then we'll test that in CI. Mid-term, I'd like to revert 79b15c8 soon so we can go back to virtio-serial which is just so much cleaner.

I can't reproduce this locally, but I have a suspicion that `tail` can exit too quickly in some circumstances, causing truncated output: openshift/os#1498 (comment) #3785 (comment) Rather than having an unconditional `sleep`, let's make it easier to test that theory by having an env var we can use to make it optional. Then we'll test that in CI. Mid-term, I'd like to revert 79b15c8 soon so we can go back to virtio-serial which is just so much cleaner.

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: coreos#3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: #3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin.

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: coreos#3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin. (cherry picked from commit bb60451)

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: #3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin. (cherry picked from commit bb60451)

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: coreos#3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin. (cherry picked from commit bb60451)

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: #3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin. (cherry picked from commit bb60451)

This is a follow-up to 79b15c8 ("cmdlib.sh: go back to using `tail -F` for command output") which was subsequently reverted. To summarize, it seems like in QEMU v8.2 (in f40), the guest sometimes would hang when writing over virtio-serial if the device is hooked up to the QEMU's stdio. In testing, removing the `<&-` hack to close QEMU's stdin fixed it for CoreOS CI but not Prow: coreos#3785 (comment) I think I've narrowed it down to CoreOS CI (i.e. Jenkins) allocating a tty and Prow not. When stdin is not a tty, QEMU would immediately gets EOF if it tries to read anything. I'm not sure exactly what happens, but I think the virtio-serial hang is linked to this (even though there's no userspace code in the guest trying to read from the virtio-serial port). Work around this by explicitly feeding `/dev/zero` to QEMU's stdin. (cherry picked from commit bb60451)

jlebon mentioned this pull request Apr 25, 2024

Fix Fedora 40 Ci Bugs coreos/rpm-ostree#4937

Merged

jlebon force-pushed the pr/f40-rebase branch from 9596178 to 5189a90 Compare April 25, 2024 18:24

jlebon mentioned this pull request Apr 25, 2024

coreos/coreos-assembler: bump to f40 openshift/release#51370

Merged

jmarrero previously approved these changes Apr 25, 2024

View reviewed changes

jlebon dismissed jmarrero’s stale review via 7857488 April 29, 2024 20:49

jlebon added the do-not-merge/hold label Apr 29, 2024

dustymabe previously approved these changes Apr 30, 2024

View reviewed changes

jlebon dismissed dustymabe’s stale review via d20b066 May 1, 2024 04:06

jlebon force-pushed the pr/f40-rebase branch 2 times, most recently from d20b066 to f124fa9 Compare May 1, 2024 15:53

jlebon added 2 commits May 1, 2024 11:59

Bump to Fedora 40

a3f9ab6

Some of our upstream CIs (ostree, rpm-ostree) require cosa and FCOS to be on the same release. Ideally we'd fix that but there's details there and we want to move cosa anyway.

jlebon force-pushed the pr/f40-rebase branch from f124fa9 to 1969997 Compare May 1, 2024 16:00

dustymabe approved these changes May 1, 2024

View reviewed changes

jlebon merged commit 79b15c8 into coreos:main May 1, 2024
2 of 5 checks passed

jlebon deleted the pr/f40-rebase branch May 1, 2024 16:07

jlebon removed the do-not-merge/hold label May 1, 2024

jlebon mentioned this pull request May 6, 2024

Add initial C10S variant openshift/os#1498

Open

jlebon commented May 6, 2024

View reviewed changes

jlebon mentioned this pull request May 6, 2024

cmdlib.sh: add env var to sleep before killing tail #3792

Merged

jlebon mentioned this pull request May 7, 2024

cmdlib.sh: feed /dev/zero as qemu stdin #3793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump to Fedora 40 #3785

Bump to Fedora 40 #3785

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

dustymabe commented Apr 25, 2024

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

jmarrero left a comment

travier commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

travier commented Apr 26, 2024

jlebon commented Apr 26, 2024

jlebon commented Apr 29, 2024

jlebon commented Apr 29, 2024

dustymabe commented Apr 30, 2024

jlebon commented Apr 30, 2024

jlebon commented May 1, 2024

jlebon commented May 1, 2024

jlebon commented May 1, 2024

jlebon May 6, 2024

jlebon May 7, 2024

Bump to Fedora 40 #3785

Bump to Fedora 40 #3785

Conversation

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

dustymabe commented Apr 25, 2024

jlebon commented Apr 25, 2024

jlebon commented Apr 25, 2024

jmarrero left a comment

Choose a reason for hiding this comment

travier commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

travier commented Apr 26, 2024

jlebon commented Apr 26, 2024

jlebon commented Apr 29, 2024

jlebon commented Apr 29, 2024

dustymabe commented Apr 30, 2024

jlebon commented Apr 30, 2024

jlebon commented May 1, 2024

jlebon commented May 1, 2024

jlebon commented May 1, 2024

jlebon May 6, 2024

Choose a reason for hiding this comment

jlebon May 7, 2024

Choose a reason for hiding this comment