Skip to content

Commit

Permalink
OCPNODE-2461: enhancement for split filesystem
Browse files Browse the repository at this point in the history
  • Loading branch information
kannon92 committed Jul 31, 2024
1 parent e97e677 commit eb11659
Showing 1 changed file with 273 additions and 0 deletions.
273 changes: 273 additions & 0 deletions enhancements/kubelet/split-filesystem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
---
title: split-filesystem
authors:
- kannon92
reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect"
- TBD
approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval.
- TBD
api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None"
- TBD
creation-date: 2024-07-30
last-updated: 2024-07-30
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- https://issues.redhat.com/browse/OCPNODE-2461
see-also:
- "https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4191-split-image-filesystem/README.md"
---

# Split Filesystem

## Open Questions

- Installer does not support creating openshift clusters with multiple disk.
Does this feature have value without users being able to configure their cluster to have a separate filesystem?

- Do we need a drop in configuration for container storage?
- https://github.com/containers/storage/pull/1885

- How does one delete all images and containers once the container runtime config is changed?
- crictl on all images and containers on each node?


## Summary

Upstream Kubernetes has released [KEP-4191](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4191-split-image-filesystem/README.md).
This feature aims to allow one to separate the read-only layers (images) from the writeable layer of a container.
Upstream Kubernetes focused on allowing Kubelet garbage collection and eviction to work if the filesystem is split.

This enhancement focuses on enablement in Openshift.

## Motivation

See [KEP Motivation](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4191-split-image-filesystem/README.md#motivation)

### Definitions

### User Stories

As an Openshift admin, I want to store images in a separate filesystem from ephemeral storage and the writeable layer.
The images can be on a read-only filesystem while ephemeral storage and the writeable layers can live on a writeable filesystem.

### Goals

- Enable ability to split filesystem in openshift
- Automate the setup of this feature to avoid user errors

### Non-Goals

- We will not support a separate disk for the entire container runtime filesystem due
to a lack of interest from customer requests.

- Customers will not have the ability to change the location of the image cache.

### Sketch of what happens when this feature is enabled

- A user specifies in the container runtime config that they want to split the filesystem (use this feature)
- Feature gate is set for kubelet
- container storage uses image store (hard coded to /var/lib/images)
- /var/lib/images is relabeled to match the same selinux labels as /var/lib/container/storage.

### Manual Enablement of Feature

In the developer preview, a user can run the following steps to enable this feature.

We will automate these steps for tech preview.

#### Feature gate

User needs to set `KubeletSeparateDiskGC` feature gate in the kubelet config.

#### Storage Configuration

One could use a butane config as follows:

```storage.bu
variant: openshift
version: 4.14.0
metadata:
name: 40-storage-override
labels:
machineconfiguration.openshift.io/role: worker
storage:
files:
- path: /etc/containers/storage.conf
mode: 0644
overwrite: true
contents:
inline: |
[storage]
# Default Storage Driver
driver = "overlay"
runroot = "/var/containers/storage"
graphroot = "/var/lib/containers/storage"
imageroot = "/var/lib/images"
```

And then run `butane storage.bu -o storage.yaml

Applying storage.yaml will apply this machine config to your workers.

#### Labeling Filesystem

One could use the following systemd file to relabel the imagestore location.

```
[Unit]
Description=Label ImageStore
After=crio-install.service
[Service]
Type=oneshot
ExecStart=rpm-ostree install \
-y \
--apply-live \
--allow-inactive \
policycoreutils-python-utils
ExecStart=semanage fcontext -a -e /var/lib/containers/storage /var/lib/images
ExecStart=restorecon -R -v /var/lib/images
[Install]
WantedBy=multi-user.target
```

#### Remove all old images

Since the image cache has changed locations, all the old images left over should be removed.

Simplest option is to remove the images on each node that this feature was enabled.

## Proposal

### Background

The main way to enable a split filesystem case is add `imageStore` in the container storage configuration.

See [container storage](https://github.com/containers/storage/blob/main/docs/containers-storage.conf.5.md).

```
imagestore="" The image storage path (the default is assumed to be the same as graphroot). Path of the imagestore, which is different from graphroot. By default, images in the storage library are stored in the graphroot. If imagestore is provided, newly pulled images will be stored in the imagestore location. All other storage continues to be stored in the graphroot. When using the overlay driver, images previously stored in the graphroot remain accessible. Internally, the storage library mounts graphroot as an additionalImageStore to allow this behavior.
```

Container storage is configured in openshift by adding this [file](https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/container-storage.yaml) to /etc/containers/storage.conf.

Container Runtime Config allows one to change the overlay size of storage. Other fields of this file are kept the same as the template.


### Feature Gate

We will add a feature gate to openshift/api.

```golang
FeatureGateKubeletSeparateDiskGC = newFeatureGate("KubeletSeparateDiskGC").
reportProblemsToJiraComponent("node").
contactPerson("kannon92").
productScope(kubernetes).
enableIn(configv1.DevPreviewNoUpgrade).
mustRegister()
```

### Configuration of container storage

### API Changes

```golang
type ContainerRuntimeConfiguration struct {
...
// +optional
SplitFilesystem bool `json:"splitFilesystem,omitempty"`

}
```

API is defined [here](https://github.com/openshift/api/blob/0d46442e8df17a87cfcf6666ab19b28e88620b59/machineconfiguration/v1/types.go#L812).

This feature will write the updated container storage file.

This will also trigger labeling of /var/lib/images.


### Risks and Mitigations

Kubernetes and Openshift do not really advertise the support of separate filesystems.
We also do not allow for most configuration of the container runtime. Changing configuration
in this area can break your system.

To derisk this scenario, in tech preview, we will propose an API to configure imagestore.

### Drawbacks

## Test Plan

TBD

## Graduation Criteria

### Dev Preview

- Ability to view if a user has configured this feature.
- Feature gates to enable this feature
- API change to streamline configuration of this feature.

### Dev Preview -> Tech Preview

Will update once we are ready to promote.

### Tech Preview -> GA

Will update once we are ready to promote.

### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

## Upgrade / Downgrade Strategy

TBD

We have a few major items to call out.

### Upgrade from feature off to feature on

Let's say scenario a does not have this feature enabled and CRI-O is not configured.

Let's say scenario b has this feature enabled.

If one wants to upgrade from scenario a to scenario b, cri-o should delete all images and repull.
This is because the cache of the images will not be located in the same location and could cause some problems.

On a reboot of the node, all existing services will repulling their images.

It will be important to remove all the images and containers before using this feature.

### Upgrade from feature on to feature on

Upgrades where the feature enablement stays the same should have no impact.

### Downgrading from feature on to feature off


## Version Skew Strategy

The support for this feature was merged into CRI in 4.15. However, this feature is only supported for 4.18 and above.

This is due to an issue in container/storage around the imagestore implementation. This feature was not backported to 4.17.

## Support Procedures

Document failure modes as this will be interesting to explain.

## Alternatives

Similar to the `Drawbacks` section the `Alternatives` section is used
to highlight and record other possible approaches to delivering the
value proposed by an enhancement, including especially information
about why the alternative was not selected.

## Infrastructure Needed [optional]

Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.

0 comments on commit eb11659

Please sign in to comment.