Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-17771: Adds enhancement for FIPS with multiple RHEL installer versions #6290

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions docs/enhancements/multi-rhel-fips.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
title: fips-with-multiple-rhel-versions
authors:
- "@carbonin"
creation-date: 2024-05-07
last-updated: 2024-05-20
---

# Support FIPS for installers built for different RHEL releases

## Summary

In order for an OpenShift cluster to be considered FIPS compliant the installer
must be run on a system with FIPS mode enabled and with FIPS compliant openssl
libraries installed. This means using a dynamically linked `openshift-install`
binary against the openssl libraries present on our container image. Today this
is not a problem because all `openshift-install` binaries in use have been
expecting to link to RHEL 8 based openssl libraries, but now OpenShift 4.16 will
ship an installer that requires RHEL 9 libraries.

This will require assisted-service to maintain a way to run the
`openshift-install` binary in a compatible environment for multiple openssl
versions. Specifically FIPS-enabled installs for pre-4.16 releases will need to
be run on an el8 image and 4.16 and later releases will need to be run on an
el9 image (regardless of FIPS).

## Motivation

FIPS compliance is important for our customers and assisted-service should be
able to install FIPS compliant clusters.

### Goals

- Allow for a single installation of assisted-service to install FIPS-compliant
clusters using installer binaries built against RHEL 8 or RHEL 9

- Allow for FIPS compliant clusters to be installed from the SaaS offering or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SaaS requires also apps SRE to chime in and install FIPS compliant clusters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have an open task for app-sre https://issues.redhat.com/browse/SDE-3692

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, also this comment was more about the solution allowing for this possibility not necessarily that, once implmented, we will be able to install FIPS from the SaaS. The main goal is to make FIPS work for both rhel8 and rhel9 installer releases.

the on-prem offering

### Non-Goals

- Changing cluster install interfaces to accommodate new FIPS requirements
should be avoided

- Dynamically determining a given release's RHEL version. Assisted service will
track the minimum version for using el9 and if a version can't be determined
for some reason (FCOS may not have the same versioning scheme) el9 will be
the default.

## Proposal

Two additional containers will run alongside assisted-service in the same pod.
These "installer-runner" containers will expose a HTTP API local to the pod
using a unix socket. assisted-service can then choose which API to contact to
run an installer binary for a specified release to generate the manifests
required for a particular install. These manifests will then be uploaded to
whatever storage is in use for this deployment (local for on-prem, or s3 for
SaaS) and assisted-service will take over as usual from there.

### User Stories

#### Story 1

As a Cluster Creator, I want to install FIPS compliant clusters for any supported OpenShift version

### Implementation Details/Notes/Constraints

#### New Images

Two new container images will need to be built, and published for every release
footprint we support. These images will be created based on existing
assisted-service code, but could be split into their own independent projects
later.

### Risks and Mitigations

Shipping a new image is a non-trivial process. This may take more time than we
have to set up. We could likely get away with using the existing
assisted-service image with a different entrypoint for one of the runner images,
but that still requires us to publish a new one for the architecture
assisted-service will not be using.

## Design Details [optional]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service closely matches what we need to implement. It could almost be borrowed as-is, mostly modifying what action it takes when receiving a request. Its purpose is to receive events from ansible-runner on a local unix socket shared between the containers.

https://github.com/operator-framework/ansible-operator-plugins/blob/main/internal/ansible/runner/eventapi/eventapi.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks about right to me, thanks for the reference


- A new `installer-runner` service will be created, written in go.
- The installer-runner will be compiled twice: once in a RHEL 8 builder image,
and once in a RHEL 9 builder image, with each resulting binary being placed
into a RHEL base image of corresponding version.


The new runner containers will expose a HTTP server using a unix socket.
assisted-service will POST to one of these servers when it needs manifests generated.
The runner container will respond with any error that occurred while generating
the manifests or with success in which case assisted-service will assume the
manifests were created and uploaded successfully.

### API

The new services would effectively wrap the existing `InstallConfigGenerator`
interface.

API call input:
- common.Cluster json
- install config
- release image

API call output:
- Appropriate HTTP response
- Error message if the call was not successful

### Installer Cache

The installer cache directory will be shared (as it's currently on the PV), but
the installers used by the two runners will never overlap.

### Packages

The installer runners will be built with the required packages to run the
installer in FIPS compliance mode.

### Open Questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the new microservice to have its own repo? That may be a good end-state for the sake of isolated testing and such.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly called this out in New Images under Implementation Details, but if you think it would be good to put somewhere else or just talked about more prominently then we can do that.

I don't think it's an open question though. For this pass I'm going to create the service in assisted-service using the existing code to run the installer. Extracting and splitting everything up would definitely come later.


1. What does the API look like for the runner containers? What data should be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple start would be to have a shared volume among all three containers.

  1. assisted-service would invoke an installer within the correct "runner container" and wait
  2. the installer would generate content and write it to the shared volume
  3. assisted-service would get the signal that the installer run was complete and get a reference to the location for its output

The shared storage could be ephemeral; it just needs to be persistent for the life of the pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do we expect the openshift-install binary to take? It should be quick enough that we could keep an http request open, and write the response to the client when it's done, right? No need for an asynchronous task API?

Maybe the exception would be if the installer binary isn't already in the local cache. Should assisted-service continue managing one cache that gets mounted into each installer-runner container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cache will be managed by the installer runner container and this container will also upload the resulting artifacts to the required location (either the shared data volume or s3).

I don't expect the operation to take too long. I think we could to a single http request for this. Even if we need to tweak the timeout parameters I'd rather that than build an async system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regarding the storage side of this.

The cache and the place where the generated files are stored after the installer is run is already a volume (specifically a PV) and this is what I intended to use for the shared storage, so I think that's a non-issue.

My question here was what exactly will we need to pass over the API call.

passed in an API call and what should be configured in the container
environment?
2. What specific packages are required for these new images?

### UI Impact

No impact

### Test Plan

FIPS and regular installs should be tested for all supported OpenShift versions.
Since this should be mostly transparent to the user, regression testing in
addition to testing 4.16 with and without FIPS should be sufficient.

## Drawbacks

- This is a complicated change in architecture something simpler might be more
desirable.

- Creating two additional containers in a pod makes the assisted service more
expensive to scale.

- Creating, maintaining, and releasing additional images is a non-trivial amount
of additional work.

## Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're saying that <4.16 is not fips complaint anyhow, can we say that for those releases we gonna use the statically linked one and for >=4.16, dynamically linked with rhel9? Can we use single existing container then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're saying that. OpenShift can be installed in FIPS mode today, and that's been the case for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ACM/SaaS world? or in general?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general and in ACM IIUC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sources for "OpenShift can be installed in FIPS mode today" include 4.12 docs for FIPS installation. And ROSA FedRAMP docs:

ROSA only uses FIPS-validated modules to process cryptographic libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered installing the RHEL 8 OpenSSL 1.1 into the RHEL 9 container alongside OpenSSL 3.0?
(Note: I haven't tried this to see what conflicts.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but I'm a bit worried about landing this in time and this approach feels like it has more unknown potential pitfalls than the approach described here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did bring up the alternative, which should probably be added to this doc, of extracting an entire rhel 8 userspace within a rhel 9 userspace, and using chroot to run an installer that needs rhel 8. But that brings other complexities around image build and management that probably aren't worthwhile compared to just running two copies of a small process in two different container images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need chroot. We can use ldpreload env to point to the relevant path when installing <4.16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this could be as simple as installing a separate ssl version somewhere else and setting some env vars when running the installer?

I can give that a try if there's a chance it might work.

Will ssl be the only library I need to override in this way? Everything else will be backward compatible? Will this actually be FIPS compliant? Are there any other requirements we would also need to satisfy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a rhel8 container (our current released image) the files provided by the older openssl rpm are these:

bash-4.4$ rpm -ql openssl-libs-1.1.1k-12.el8_9.x86_64
/etc/pki/tls
/etc/pki/tls/certs
/etc/pki/tls/ct_log_list.cnf
/etc/pki/tls/misc
/etc/pki/tls/openssl.cnf
/etc/pki/tls/private
/usr/lib/.build-id
/usr/lib/.build-id/14
/usr/lib/.build-id/14/0abbb0c09726652dd61128b74b8bb5010f5542
/usr/lib/.build-id/42
/usr/lib/.build-id/42/9301d8c47d78f29d7a39fe01d8076e7b656e4e
/usr/lib/.build-id/6f
/usr/lib/.build-id/6f/619a1566052cd522eb945d91197ad60852a8f8
/usr/lib/.build-id/bb
/usr/lib/.build-id/bb/0f1c2857f6f9e4f68c7c5dc36a41c441318eca
/usr/lib/.build-id/e1
/usr/lib/.build-id/e1/1d63787a0230c919507d7ebc73e8af7e8e8a2b
/usr/lib64/.libcrypto.so.1.1.1k.hmac
/usr/lib64/.libcrypto.so.1.1.hmac
/usr/lib64/.libssl.so.1.1.1k.hmac
/usr/lib64/.libssl.so.1.1.hmac
/usr/lib64/engines-1.1
/usr/lib64/engines-1.1/afalg.so
/usr/lib64/engines-1.1/capi.so
/usr/lib64/engines-1.1/padlock.so
/usr/lib64/libcrypto.so.1.1
/usr/lib64/libcrypto.so.1.1.1k
/usr/lib64/libssl.so.1.1
/usr/lib64/libssl.so.1.1.1k
/usr/share/licenses/openssl-libs
/usr/share/licenses/openssl-libs/LICENSE

I suppose I could use a multi-stage build to copy those files into a special directory on the final (rhel9) image then try use LD_PRELOAD to push those library versions instead of the ones installed on the system.

I'd need a fully FIPS enabled environment to test if it's working though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LD_PRELOAD (or rather, LD_LIBRARY_PATH which I think is what you're going for here) is not really the issue - libssl.so.1.1 and libssl.so.3.0 are different major versions of the shared library, so AIUI the binary will know to load the right one. This issue is what other files the RPM installs that might conflict. Mostly just those .cnf files by the looks of it?

Copy link
Member Author

@carbonin carbonin May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you can install different versions of the same RPM on the same system.
Also I don't know if the older ssl package is even available in the el9 repos.

My idea was to use a multi-stage container build, install the ssl rpm on an el8 image then copy the files directly into a directory in later stage. Then when we know we need to run the el8 binary we'd set whatever envs we need to allow it to find the old so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully I'll have an environment to test this in this morning and then if it works I'll run it by the FIPS experts to figure out if this process would be considered compliant.


### Use Jobs

Hive is investigating using the container image from the release payload to run
the installer as a Job.

- This wouldn't work for the podman deployment which isn't directly productized
or supported, but is still a way we encourage people to try out the service.
This could be overcome by retaining a way to run the installer on the
service container, but then both methods need to be tested and maintained.

- This wouldn't work for Agent Based Installer as ABI runs the services using
podman. This could also be overcome by retaining a way to run the installer
local to the service as the image version run by ABI will always match the
target cluster, but again both methods of running the installer would need to
be maintained indefinitely.

- It's unclear how many jobs we would end up running concurrently. It would be
difficult to find out from the SaaS how many installer processes are being run
concurrently (maybe we should create a metric for this), but the telco scale
team regularly runs several hundred concurrently maxing out at over three
thousand in a few hours. Unless we're cleaning up the jobs rather aggressively
I don't think it would be good to create this many.

- Multiple jobs would need to be run on a single assets directory. This seems
prohibitively complex compared to the proposed solution. During a single
install the following installer commands are used:
- `openshift-baremetal-install create manifests`
- `openshift-baremetal-install create single-node-ignition-config` or
`openshift-baremetal-install create ignition-configs` (depending on HA mode)

### Run the matching installer on the assisted-service container

Clusters that have installers that already match the assisted service container's
architecture could be handled by the assisted-service container as we do today.
This would require one less image and container per pod, but having the same
process for every cluster install would be easier to understand and maintain.

### Use RPC over HTTP

[Go's RPC](https://pkg.go.dev/net/[email protected]) could be used instead of a direct
HTTP server (RPC can be hosted over HTTP, but that's not what is being addressed
here). RPC would make this a simpler change as the code for generating the
manifests is already contained in a single package, but RPC would be a strange
choice if we were to move the handling into a truly separate service in the
future.

### Install multiple libraries on the same image

It may be possible to install both versions of the shared libraries required by
the installers (libcrypto and libssl?) for FIPS compliance on a single image.
This would require much less change and should be significantly quicker to
implement, but it's not clear if these would be possible or supportable.
This could be achieved by any of the following methods:

1. Create a separate userspace for el8 libraries and chroot when those libraries
are required.
- This seems a bit complicated and it will likely make our image quite a bit
larger than it already is (~1.3G).
Comment on lines +205 to +208
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out and was able to get both installers to run in a container on a FIPS-enabled RHEL9 host using chroot for the rhel8 one. I only needed to copy /usr from the el8 image and create the top-level symlinks in the chroot target dir. Based on my tests with the centos images this adds about 200M to the image size.

I still need to get assurance that this would actually result in a FIPS-compliant cluster, but if it does it seems like the most promising option to me.

I think this is worth a POC. @mhrivnak @romfreiman What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this approach cause a problem for image scanning and tracking of packages for CVEs and such?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked about this in the FIPS-related internal channels and it turns out nothing like this would be considered FIPS-compliant so this disqualifies any approach that doesn't run the installer on a base container image that it is expecting to run on.

I'll add this to the enhancement text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chroot :)

2. Install both versions of the required packages simultaneously.
- Not sure if this is possible given that the packages share a name and are
only different in version.
3. Use multi-layer container builds to copy the libraries from an el8 image to a
directory on the el9 image and use `LD_PRELOAD` or manipulate `LD_LIBRARY_PATH`
to point the el8 installer binaries to the correct libraries.

The approach using chroot worked, but FIPS SMEs said that the container base
image *must* match the installer for the resulting cluster to be considered
FIPS-compliant so none of these multi-library options are valid.

### Publish multiple assisted-service images
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romfreiman does this look like what you were talking about?
Tried to capture it here.


It's likely that a user installing in FIPS mode will only be interested in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we think this is true?

Everyone starts with a single version. But existing clusters and their use cases lag behind new clusters and new use cases. I would expect that most users end up with a mix of openshift versions over time.

Keep in mind that users need the ability to re-provision their clusters at any time, for recovery from all kind of unexpected events. And a growing number of customers depend on regular "cluster as a service" provisioning. Telling such a user that they can only (re-)provision a subset of openshift versions would generally be a big disadvantage.

Even a large-scale user with relatively uniform versions is not going to upgrade all of their clusters at once. It would not be ideal to put them in a position where they can only provision clusters for the newest openshift versions in their fleet.

Is there reason to think that a FIPS user would have different patterns and expectations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this is an assumption on my part, but my experience with even moderately conservative customers is that they have a single version of OCP that is vetted and allowed in the org (at least in production).

My guess was that any user interested in full FIPS compliance would be even more strict than that, but I have no actual users to back any of this up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for upgrade and reprovisioning. I do consider this option a temporary solution that buys us time to properly think through and vet something that would work more generically. Ideally in the version following the one this alternative would be implemented in we'd implement the main body of this enhancement (or something similar).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They start with a single vetted version, then that progresses over time, but upgrades lag.

I think it's still better to implement this enhancement as described, which would fully preserve the existing use cases and user experience, and then optimize it later as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak all good. can we find such customer now? we'll have it as an rfe and improve in the next release. I dont think we should restructure the whole service 2 weeks before the release.
for next release, lets reconsider.

installing a single OCP version at a time. This means that a given version of
assisted will need to still support both el8 and el9 FIPS installs, but a single
deployment of assisted would not.

To handle this the assisted-service image would be built twice; once based on
el8 and again based on el9. Both images would be released and the operator would
choose which to deploy based on configuration (likely an annotation since a more
robust solution would be preferred in the future).

For example, in the case that a user knew they wanted to deploy OCP 4.14 from a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the user who owns the configuration of the management cluster itself is the same user who is provisioning clusters. But often that role is separate. The "Cluster Creator" would need to work with the cluster admin to re-configure assisted-installer to switch it from el8 mode to el9 mode. Then what if they need to switch back? This could become a real hassle.

It would help to have automation that recognizes when the config doesn't work for a desired install operation, but resolving that could be a pain. Even for cases where it is the same person doing both personas, it's inconvenient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already something that the application admin is choosing to a degree when they set the available OS images, so I don't think it's too much of a stretch for them to also communicate which OCP versions will be in use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. So then the time will come when the cluster creator asks the admin to add a new version that requires removal of all other versions. Then they'll have to negotiate an upgrade plan with all stakeholders to get the whole fleet safely over the hump, etc etc. Doesn't that seem like a pain? If we can avoid creating that burden for our customers, we'll all be better-off.

FIPS-enabled hub cluster they would need to indicate to the operator that the
el8-based assisted-installer should be deployed. Assisted service could also
check that the OCP version, current base OS, and FIPS-mode were all aligned
before attempting to run the installer.

To avoid issues when installing in a non-FIPS environment the assisted-service
could also move to default to the statically linked installer binary for OCP
4.16 and above, but this doesn't change anything for earlier releases.

This would be something that could be implemented more quickly with less risk
while also leaving open the possibility of a more complex solution to the general
problem in a future release.