Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap node is not serving ignition files #2041

Open
alrf opened this issue Oct 8, 2024 · 10 comments
Open

Bootstrap node is not serving ignition files #2041

alrf opened this issue Oct 8, 2024 · 10 comments
Assignees
Labels
OKD SCOS 4.16 pre-release-testing Items related to testing nightlies before a release.

Comments

@alrf
Copy link

alrf commented Oct 8, 2024

OKD Version: 4.15.0-0.okd-2024-03-10-010116
FCOS Version: 39.20240210.3.0 (CoreOS)

https://api-int.mydomain.com:22623/config/master shows HTTP ERROR 500

In the logs of the machine-config-server container I see:

E1008 11:17:17.901441       1 api.go:183] couldn't convert config for req: {master 0xc0003a2380}, error: failed to convert config from spec v3.2 to v2.2: unable to convert Ignition spec v3 config to v2: SizeMiB and StartMiB in Storage.Disks.Partitions is not supported on 2.2
@JaimeMagiera
Copy link
Contributor

Hi,

Not sure if you saw either of these...

https://okd.io/blog/2024/06/01/okd-future-statement/
https://okd.io/blog/2024/07/30/okd-pre-release-testing/

So, you'll want to try an install of 4.16. Are you writing the ignition yourself? As the error notes, there is a mismatch of versions.

@JaimeMagiera JaimeMagiera self-assigned this Oct 8, 2024
@alrf
Copy link
Author

alrf commented Oct 8, 2024

Hi @JaimeMagiera, I've not seen this. The ignition files are written by openshift-install tool.
I saw this discussion: #2029
Is release 4.16 already published somewhere?
Actually, what is the official resource/url now to check for OKD releases?
Even the documentation for 4.17 mentions only FCOS: https://docs.okd.io/4.17/installing/overview/index.html#about-rhcos

@alrf
Copy link
Author

alrf commented Oct 9, 2024

I found releases here: https://github.com/okd-project/okd-scos/releases
But openshift-install v4.16 shows fedora images only:

# openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.qemu.formats."qcow2.xz".disk.location'
https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/39.20231101.3.0/x86_64/fedora-coreos-39.20231101.3.0-qemu.x86_64.qcow2.xz

# openshift-install coreos print-stream-json | grep -c fedora
113
# openshift-install coreos print-stream-json | grep -c centos
0
# openshift-install coreos print-stream-json | grep -c coreos
113

# openshift-install version
openshift-install 4.16.0-0.okd-scos-2024-09-27-110344
built from commit 17f65f8808858b0111fd1624e7ee45f96efdc5dc
release image quay.io/okd/scos-release@sha256:eb66b3806689ad6fd068c965e0800813a8193ca1b2be748ec481521bb98a9962

How to download then v4.16 SCOS disk for bare-metal installation?

@JaimeMagiera
Copy link
Contributor

Sorry for the confusion. That repository for OKD-SCOS is not relevant here. The current state of OKD is that the nodes start as FCOS, and boot into SCOS using rpm-ostree after the installer runs. You can use fedora-coreos-39.20231101.3.0 for your nodes.

Currently, there are only nightly builds of OKD SCOS. We haven’t signed off on a GM release yet. We’re getting close and actually could use your help testing. Our ability to test bare-metal has been limited. A nightly that has passed E2E testing is here…

https://amd64.origin.releases.ci.openshift.org/releasestream/4-scos-stable/release/4.16.0-0.okd-scos-2024-09-24-151747

Let us know how it goes. Thanks.

@alrf
Copy link
Author

alrf commented Oct 9, 2024

Thank you for the link. But getting the installer and client looks very weird, no direct links to archives:

Download installer and client with:

oc adm release extract --tools quay.io/okd/scos-release:4.16.0-0.okd-scos-2024-09-24-151747

Imagine I have a new host and no oc installed, it is chicken-egg problem.

Back to the actual problems, I managed to install a bootstrap node and 1 master node on bare-metal.
However, on the bootstrap node:

[core@bootstrap01 ~]$ sudo -s
[systemd]
Failed Units: 1
  systemd-sysusers.service
[root@bootstrap01 core]# systemctl status systemd-sysusers.service
× systemd-sysusers.service - Create System Users
     Loaded: loaded (/usr/lib/systemd/system/systemd-sysusers.service; static)
     Active: failed (Result: exit-code) since Wed 2024-10-09 12:38:57 UTC; 55min ago
   Duration: 4.328s
       Docs: man:sysusers.d(5)
             man:systemd-sysusers.service(8)
    Process: 916 ExecStart=systemd-sysusers (code=exited, status=1/FAILURE)
   Main PID: 916 (code=exited, status=1/FAILURE)
        CPU: 55ms

Oct 09 12:38:57 bootstrap01.test-env.mydomain.com systemd[1]: Starting Create System Users...
Oct 09 12:38:57 bootstrap01.test-env.mydomain.com systemd-sysusers[916]: Creating group 'sgx' with GID 991.
Oct 09 12:38:57 bootstrap01.test-env.mydomain.com systemd-sysusers[916]: /etc/gshadow: Group "sgx" already exists.
Oct 09 12:38:57 bootstrap01.test-env.mydomain.com systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Oct 09 12:38:57 bootstrap01.test-env.mydomain.com systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Oct 09 12:38:57 bootstrap01.test-env.mydomain.com systemd[1]: Failed to start Create System Users.

On the master node rpm-ostree fails:

[root@serverXXX core]# rpm-ostree status
A dependency job for rpm-ostreed.service failed. See 'journalctl -xe' for details.
○ rpm-ostreed.service - rpm-ostree System Management Daemon
     Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static)
    Drop-In: /etc/systemd/system/rpm-ostreed.service.d
             └─10-mco-default-env.conf
             /run/systemd/system/rpm-ostreed.service.d
             └─bug2111817.conf
             /etc/systemd/system/rpm-ostreed.service.d
             └─mco-controlplane-nice.conf
     Active: inactive (dead)
       Docs: man:rpm-ostree(1)

Oct 09 15:19:01 serverXXX.mydomain.com systemd[1]: Dependency failed for rpm-ostree System Management Daemon.
Oct 09 15:19:01 serverXXX.mydomain.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
error: Loading sysroot: exit status: 1

Additionally, I was not able to install the second master node because all required containers (like kube-*) stopped and disappeared on the bootstrap node:

# oc get node
The connection to the server api.test-env.mydomain.com:6443 was refused - did you specify the right host or port?


[root@bootstrap01 core]# crictl ps -a
CONTAINER           IMAGE                                                                                              CREATED             STATE               NAME                ATTEMPT             POD ID              POD
2e193f2a99c22       625b24f036a1dcd1480cceb191dee59b2945e8a73bd750f977e29bb177122092                                   About an hour ago   Running             etcd                0                   126366c482e0d       etcd-bootstrap-member-bootstrap01.test-env.mydomain.com
7dd584a6695b0       quay.io/okd/scos-content@sha256:0637f82bd0b20204b87f50c55a8fd627701a8f09f78fd2fa7c5a6a4ac8054a87   About an hour ago   Running             etcdctl             0                   126366c482e0d       etcd-bootstrap-member-bootstrap01.test-env.mydomain.com
[root@bootstrap01 core]#

I came across with the same behaviour on v4.15.

@JaimeMagiera
Copy link
Contributor

In terms of the chicken/egg situation, we have a new Community Testing page with a link to the oc binaries.

https://okd.io/docs/community/community-testing/#getting-started

@JaimeMagiera JaimeMagiera added OKD SCOS 4.16 pre-release-testing Items related to testing nightlies before a release. labels Oct 9, 2024
@JaimeMagiera
Copy link
Contributor

Can you walk me through the process you're following to install OKD? I feel like there's something missing. Also, what is your bare-metal configuration?

@alrf
Copy link
Author

alrf commented Oct 9, 2024

Seems I found something:

Oct 09 16:47:39 serverXXX.mydomain.com systemd-fsck[111287]: /dev/md126 has unsupported feature(s): FEATURE_C12
Oct 09 16:47:39 serverXXX.mydomain.com systemd-fsck[111287]: e2fsck: Get a newer version of e2fsck!
Oct 09 16:47:39 serverXXX.mydomain.com systemd-fsck[111287]: boot: ********** WARNING: Filesystem still has errors **********
Oct 09 16:47:39 serverXXX.mydomain.com systemd-fsck[111285]: fsck failed with exit status 12.
Oct 09 16:47:39 serverXXX.mydomain.com systemd[1]: systemd-fsck@dev-disk-by\x2duuid-0464017b\x2d51dc\x2d45bb\x2da6a6\x2db96ba296763d.service: Main process exited, code=exited, status=1/FAILURE

Oct 09 16:47:39 serverXXX.mydomain.com systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Oct 09 16:47:39 serverXXX.mydomain.com systemd[1]: boot.mount: Job boot.mount/start failed with result 'dependency'.
Oct 09 16:47:39 serverXXX.mydomain.com systemd[1]: Software RAID monitoring and management was skipped because of an unmet condition check (ConditionPathExists=/etc/mdadm.conf).

i.e. rpm-ostree fails due to the systemd-fsck service and systemd-fsck - because of e2fsck: Get a newer version of e2fsck.

Bare metal config:
8 CPU, 64GB RAM, 477GB 2 disks in software RAID1.

@alrf
Copy link
Author

alrf commented Oct 9, 2024

FEATURE_C12 is about orphan_file: https://askubuntu.com/a/1514580
have Linux kernel newer then v5.15, because orphan_file feature requires it.

Checking my servers, SCOS has:

[root@serverSCOS core]# uname -a
Linux serverSCOS.mydomain.com 5.14.0-511.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 19 06:52:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

But FCOS (I have another bare-metal server installed with v4.15 & Fedora CoreOS):

[root@serverFCOS core]# uname -a
Linux serverFCOS.mydomain.com 6.7.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb  5 22:21:14 UTC 2024 x86_64 GNU/Linux

That's the reason why I didn't have any issues with rpm-ostree on FCOS - the kernel there is newer.

@alrf
Copy link
Author

alrf commented Oct 10, 2024

The same results on v4.17.
I also found this discussion: #1997
IMO, neither 4.16 nor 4.17 can be released until the orphan_file issue is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OKD SCOS 4.16 pre-release-testing Items related to testing nightlies before a release.
Projects
None yet
Development

No branches or pull requests

2 participants