From 957070f5ddb07a5e6b649610edc918e8c7abf0b3 Mon Sep 17 00:00:00 2001 From: Kevin Hannon Date: Tue, 30 Jul 2024 17:53:21 -0400 Subject: [PATCH] OCPNODE-2461: enhancement for split filesystem --- enhancements/kubelet/split-filesystem.md | 282 +++++++++++++++++++++++ 1 file changed, 282 insertions(+) create mode 100644 enhancements/kubelet/split-filesystem.md diff --git a/enhancements/kubelet/split-filesystem.md b/enhancements/kubelet/split-filesystem.md new file mode 100644 index 0000000000..683baccbd6 --- /dev/null +++ b/enhancements/kubelet/split-filesystem.md @@ -0,0 +1,282 @@ +--- +title: split-filesystem +authors: + - kannon92 +reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" + - TBD +approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval. + - TBD +api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" + - TBD +creation-date: 2024-07-30 +last-updated: 2024-07-30 +tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement + - https://issues.redhat.com/browse/OCPNODE-2461 +see-also: + - "https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4191-split-image-filesystem/README.md" +--- + +# Split Filesystem + +## Open Questions + +- Installer does not support creating openshift clusters with multiple disk. + Does this feature have value without users being able to configure their cluster to have a separate filesystem? + +- Do we need a drop in configuration for container storage? + - https://github.com/containers/storage/pull/1885 + +- How does one delete all images and containers once the container runtime config is changed? + - crictl on all images and containers on each node? + +- What is the best form of telemetry to show that a customer is using this feature? + +- How would this feature work with layering? + +- Day 2 operations for adding disks to openshift + +## Summary + +Upstream Kubernetes has released [KEP-4191](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4191-split-image-filesystem/README.md). +This feature aims to allow one to separate the read-only layers (images) from the writeable layer of a container. +Upstream Kubernetes focused on allowing Kubelet garbage collection and eviction to work if the filesystem is split. + +This enhancement focuses on enablement in Openshift. + +## Motivation + +See [KEP Motivation](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4191-split-image-filesystem/README.md#motivation) + +### Definitions + +### User Stories + +As an Openshift admin, I want to store images in a separate filesystem from ephemeral storage and the writeable layer. +The images can be on a read-only filesystem while ephemeral storage and the writeable layers can live on a writeable filesystem. + +### Goals + +- Enable ability to split filesystem in openshift +- Automate the setup of this feature to avoid user errors + +### Non-Goals + +- We will not support a separate disk for the entire container runtime filesystem due +to a lack of interest from customer requests. + +- Customers will not have the ability to change the location of the image cache. + +### Sketch of what happens when this feature is enabled + +- A user specifies in the container runtime config that they want to split the filesystem (use this feature) + - Feature gate is set for kubelet + - container storage uses image store (hard coded to /var/lib/images) + - /var/lib/images is relabeled to match the same selinux labels as /var/lib/container/storage. + +### Manual Enablement of Feature + +In the developer preview, a user can run the following steps to enable this feature. + +We will automate these steps for tech preview. + +#### Feature gate + +User needs to set `KubeletSeparateDiskGC` feature gate in the kubelet config. + +#### Storage Configuration + +One could use a butane config as follows: + +```storage.bu +variant: openshift +version: 4.14.0 +metadata: + name: 40-storage-override + labels: + machineconfiguration.openshift.io/role: worker +storage: + files: + - path: /etc/containers/storage.conf + mode: 0644 + overwrite: true + contents: + inline: | + [storage] + + # Default Storage Driver + driver = "overlay" + + runroot = "/var/containers/storage" + graphroot = "/var/lib/containers/storage" + imageroot = "/var/lib/images" +``` + +And then run `butane storage.bu -o storage.yaml + +Applying storage.yaml will apply this machine config to your workers. + +#### Labeling Filesystem + +One could use the following systemd file to relabel the imagestore location. + +``` + [Unit] + Description=Label ImageStore + After=crio-install.service + + [Service] + Type=oneshot + ExecStart=rpm-ostree install \ + -y \ + --apply-live \ + --allow-inactive \ + policycoreutils-python-utils + ExecStart=semanage fcontext -a -e /var/lib/containers/storage /var/lib/images + ExecStart=restorecon -R -v /var/lib/images + + [Install] + WantedBy=multi-user.target +``` + +#### Remove all old images + +Since the image cache has changed locations, all the old images left over should be removed. + +Simplest option is to remove the images on each node that this feature was enabled. + +#### Checking if feature is enabled on a node. + +One can use `crictl imagefsinfo` to see if the filesystem is split. This will show imageFilesystems and containerFilesystems. + +If they are split you would need a different mount in containerFilesystems. + +## Proposal + +### Background + +The main way to enable a split filesystem case is add `imageStore` in the container storage configuration. + +See [container storage](https://github.com/containers/storage/blob/main/docs/containers-storage.conf.5.md). + +``` +imagestore="" The image storage path (the default is assumed to be the same as graphroot). Path of the imagestore, which is different from graphroot. By default, images in the storage library are stored in the graphroot. If imagestore is provided, newly pulled images will be stored in the imagestore location. All other storage continues to be stored in the graphroot. When using the overlay driver, images previously stored in the graphroot remain accessible. Internally, the storage library mounts graphroot as an additionalImageStore to allow this behavior. +``` + +Container storage is configured in openshift by adding this [file](https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/container-storage.yaml) to /etc/containers/storage.conf. + +Container Runtime Config allows one to change the overlay size of storage. Other fields of this file are kept the same as the template. + + +### Feature Gate + +We will add a feature gate to openshift/api. + +```golang + FeatureGateKubeletSeparateDiskGC = newFeatureGate("KubeletSeparateDiskGC"). + reportProblemsToJiraComponent("node"). + contactPerson("kannon92"). + productScope(kubernetes). + enableIn(configv1.DevPreviewNoUpgrade). + mustRegister() +``` + +### Configuration of container storage + +### API Changes + +```golang +type ContainerRuntimeConfiguration struct { + ... + // +optional + SplitFilesystem bool `json:"splitFilesystem,omitempty"` + +} +``` + +API is defined [here](https://github.com/openshift/api/blob/0d46442e8df17a87cfcf6666ab19b28e88620b59/machineconfiguration/v1/types.go#L812). + +This feature will write the updated container storage file. + +This will also trigger labeling of /var/lib/images. + +### Risks and Mitigations + +Kubernetes and Openshift do not really advertise the support of separate filesystems. +We also do not allow for most configuration of the container runtime. Changing configuration +in this area can break your system. + +To derisk this scenario, in tech preview, we will propose an API to configure imagestore. + +### Drawbacks + +## Test Plan + +TBD + +## Graduation Criteria + +### Dev Preview + +- Ability to view if a user has configured this feature. +- Feature gates to enable this feature +- API change to streamline configuration of this feature. + +### Dev Preview -> Tech Preview + +Will update once we are ready to promote. + +### Tech Preview -> GA + +Will update once we are ready to promote. + +### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +## Upgrade / Downgrade Strategy + +TBD + +We have a few major items to call out. + +### Upgrade from feature off to feature on + +Let's say scenario a does not have this feature enabled and CRI-O is not configured. + +Let's say scenario b has this feature enabled. + +If one wants to upgrade from scenario a to scenario b, cri-o should delete all images and repull. +This is because the cache of the images will not be located in the same location and could cause some problems. + +On a reboot of the node, all existing services will repulling their images. + +It will be important to remove all the images and containers before using this feature. + +### Upgrade from feature on to feature on + +Upgrades where the feature enablement stays the same should have no impact. + +### Downgrading from feature on to feature off + +## Version Skew Strategy + +The support for this feature was merged into CRI in 4.15. However, this feature is only supported for 4.18 and above. + +This is due to an issue in container/storage around the imagestore implementation. This feature was not backported to 4.17. + +## Support Procedures + +Document failure modes as this will be interesting to explain. + +## Alternatives + +Similar to the `Drawbacks` section the `Alternatives` section is used +to highlight and record other possible approaches to delivering the +value proposed by an enhancement, including especially information +about why the alternative was not selected. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure.