idea: "upper layer" (erofs) inside of initramfs #332

allisonkarlitskaya · 2024-09-05T15:29:34Z

This is a really vague idea that I discussed with @cgwalters and @travier today. They both said it belongs here as an issue. At this point this is little more than a raw braindump. There's a lot to think through and discuss.

The erofs produced by mkcomposefs on a reasonably complete /usr is on the order of double digits MB. I've seen ~50MB generally, and it compresses well (down to more like 10MB). The initramfs+kernel on my Silverblue system is low triple digits (~150MB, most of which is the initramfs).

It wouldn't be completely unreasonable, then, to have a complete static copy of the composefs "upper layer" erofs image inside of the UKI. This would completely side-step quite a lot of thorny issues around binding the UKI to the correct deployment: all you'd need is the kernel image and the digest store.

How we get a UKI with this erofs inside of it could go two ways:

generate this on the end-user system by (deterministic magic) which lets us get a UKI which is bit-for-bit the same as the one we were expecting it to be. We'd have some out-of-band signature somewhere (in some metadata that doesn't become part of the image) that we could then use for signing this.
push everything to the container image creation: the kernel image would be created as the last step of the image creation process. This would involve running mkcomposefs inside of the container, on the contents of the container itself, and embedding the resulting blob into the UKI, which we'd then write to the container image at a well-known path. Any signing that we might want to do as part of creating the image could happen at this point, inside of the image (or in another build stage and copied back into the final image).

The second approach has an extremely simple deployment strategy: just extract the container .tar directly into a composefs digest store (without creating the erofs). The backing store should now contain all of the files that the erofs referred to. Install the kernel image into the EFI ESP and you're done.

The second way seems wonderfully simple until you realize that there are some very serious drawbacks there:

we're essentially creating a new container format: the metadata about which files are part of the image is stored in the .tar of the image, but now also in the erofs that we put inside of the UKI.
which means, of course, that it's no longer possible to make casual modifications to the container to add a file or install an extra package or so: you need to regenerate the kernel image. Maybe that's not so bad?

I think the second approach could be extremely nice for specific deployment scenarios, but it's a very different flavour than what has been promised for the "FROM fedora / ADD / RUN / ..." approach to OS customization.

So that takes us back to a reality where we probably want to support the first scenario of building the composefs and assembling the UKI on the end system. That needs a lot of thinking...

This also intersects with the question about what a signature from an OS vendor on a particular kernel image means. Today it's possible to have a signed kernel boot an unsigned root filesystem. Tomorrow we seem to want to go into a direction where there's additional assurances about the root filesystem contents as well, but if it remains possible to continue booting arbitrary root filesystems with a different version of the same kernel, then this promise is a whole lot less meaningful. In fact, the entire "look how easy it is to customize your system!" bootc ideal sort of depends on being able to modify the root filesystem without needing to resign the kernel... @travier mentioned that we can support both scenarios with kernel variations which produce unique PCR measurements, allowing the data partition to be encrypted by a key that will only be available if we boot a "trusted" rootfs. There are some very deep product-level decisions here...

The text was updated successfully, but these errors were encountered:

allisonkarlitskaya · 2024-09-05T15:35:03Z

One note about performance/memory trade-offs: having the erofs as part of the UKI (and then permanently stored in RAM) would mean that the entire metadata of the system partition is in RAM. ls -lR /usr would always happen without touching the disk. It's more data to load when booting the kernel image, but having that data pre-loaded as a small blob up front seems like it should probably be a net win. It would have to be measured. It also means that we have a chunk of RAM that we've "wasted"...

allisonkarlitskaya · 2024-09-05T15:39:49Z

Another requirement of the "UKI inside the OCI container" approach (and maybe the "UKI generated locally" approach as well): we'd probably want a tool that could scan the UKI to find out which blobs its refers to in the digest store. This is important for pruning the store when removing old images.

travier · 2024-09-05T16:52:37Z

One part of implementing this idea is to adapt https://github.com/ostreedev/ostree/blob/main/src/switchroot/ostree-prepare-root.c to use this EROFS instead of looking at the sysroot.

travier · 2024-09-09T08:36:01Z

Here is a potential flow where we could use that feature that would help us workaround SELinux issues and remove the need for build time commits:

Build via a Containerfile:

# "Normal" build part where you customize your image
FROM base-image as target
RUN Make changes here as needed

# Use a side image to build the composefs & UKI
FROM target as builder
RUN Rebuild SELinux policy
RUN - Do an ostree commit with the changes (i.e. we need to figure out what changed)
    - using the context from the updated SELinux policy
    - and get the full composefs EROFS for the final root
RUN Compress and append the EROFS blob to the initramfs in a pre-defined place
RUN Install ukify & Secure Boot signing tools
RUN Build a UKI with the kernel, initramfs, command line config from the container image and sign it, output to /uki

# Go back to the final image and include just the UKI
FROM target
COPY --from builder /uki /uki

Then on the final system we would do:

ostree container image pull which will import all the objects from the "target" image, including the UKI. We will just ignore the xattrs and SELinux labels.
Copy the UKI from the imported ostree commit to the ESP
Do the rename dance to get it in the right order for boot
Reboot

We tried something similar while prototyping: https://github.com/travier/fedora-coreos-uki/blob/main/fcos-uki/Containerfile

travier · 2024-09-09T08:47:32Z

The major change with this approach is that we clearly split the file content from the metadata and the container becomes a way to only transport object data plus a UKI which includes all the metadata. Thus the deployed rootfs becomes an object store only and we don't "care" about ostree commits anymore as we don't need to sign them or use them to regenerate the composefs metadata on the systems.

cgwalters · 2024-09-09T12:51:58Z

Another requirement of the "UKI inside the OCI container" approach (and maybe the "UKI generated locally" approach as well): we'd probably want a tool that could scan the UKI to find out which blobs its refers to in the digest store. This is important for pruning the store when removing old images.

Yes. Combining with this comment in general it argues for some new tooling - not too large or complex tooling but new tooling nevertheless. One option is to implement it in this repo as a build-time option - a variant of that is to implement it in Rust (also in this repo). Maybe something like a composefs-boot crate?

cgwalters · 2024-09-09T13:00:13Z

I chatted with @allisonkarlitskaya about this and there's a lot to like about the simplicity of this approach - I'm 100% on board with continuing investigation of this direction.

My biggest concern was that I'd also really like to build the story of using composefs for apps/extensions/configmaps etc. and this model reduces the alignment between those two approaches.

Combining, this issue also intersects strongly with #294 where I was trying hard to think of a way to bring OCI metadata under verity protection. Hmmm...I guess probably the simplest variant that would work for this is to require the UKI to always be in a distinct layer (with a special annotation like composefs.boot or something), and the manifest that gets included inside the image doesn't have that layer.
Also worth thinking about here is the related issue I was thinking about around how we store individual layers. We must support only fetching changed layers across upgrades.

travier · 2024-09-13T09:14:47Z

In #332 (comment), I forgot that we still need to do the 3-way merge for /etc so we still a "deployment" of it, so this is a bit more complex.

travier · 2024-09-13T11:21:52Z

We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.

jbtrystram · 2024-09-13T11:32:28Z

We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.

(when using LUKS on the rootfs)

cgwalters · 2024-09-13T15:33:03Z

In #332 (comment), I forgot that we still need to do the 3-way merge for /etc so we still a "deployment" of it, so this is a bit more complex.

For ostree yes, though we also support etc.transient where that wouldn't be needed.

I think in theory we could ship initramfs glue in this project such that "mount composefs from initramfs" logic could in theory be very agnostic, i.e. we have:

sysroot.mount
composefs-mount.service (replaces /sysroot with a composefs setup, with backing objects in something native like /composefs/objects maybe? But the backing store can be configured in some way (an xattr on the cfs? a config file?))
ostree-prepare-root.service (mounts /etc and /var in the way ostree does it today using the physical root, which also note the intersection with Canonical method to find backing filesystem (and block device) #280 ), but the ostree bits could obviously be replaced with something else for non-ostree consumers
initrd-root-fs.target
...
switchroot

cgwalters · 2024-09-13T15:51:01Z

We've also realized that including the composefs EROFS file in the UKI means that it is now public, thus the the file listing and metadata is public. This is not really an issue but just something to be aware of.

Instead of "public" I would say "not encrypted on disk" to be clear. "public" often implies to me "accessible to the whole Internet" but for images generated on premise and deployed to servers that are physically secured, I wouldn't say the UKIs here are "public".

That said...AFAIK there's nothing that would block someone from encrypting the erofs in the initramfs, and decrypting using e.g. a key stored in the TPM or something.

ericcurtin · 2024-09-27T08:23:42Z

As regards composefs/erofs inside of a UKI, this wouldn't work so well for CentOS Automotive Stream Distribution/Red Hat In-Vehicle OS. Two reasons.

We spent a lot of time minimising initramfs for super-fast boots, we are talking < 10M in size and < 2 seconds in boot time. Now we do have to read the whole composefs eventually for verification. During the initial read of the UKI, userspace cannot proceed with anything until the whole UKI is read, decompressed and the kernel populates the initramfs filesystem.

The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.

In Automotive can fork a little from the technique decided on here, we already do that as one of the users of composefs.

All we need to do is store a digest in initramfs to ensure what we are booting is what we intend and many of these concerns go away.

Also tagging @alexlarsson he'd likely be interested in a read here.

ericcurtin · 2024-09-27T08:31:52Z

In fact, and I've discussed this with the systemd guys once or twice and they agree. initramfs is a dated filesystem, we should keep it as small (and as irrelevant) as possible. There are more efficient ways of creating volatile throwaway verified filesystems these days (composefs, erofs, overlayfs, fs-verity, dm-verity, etc.). Also, if one is referencing erofs inside you cannot unmount the initramfs.

cgwalters · 2024-09-27T09:01:23Z

We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).

The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".

The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.

That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)

Let's be a bit more specific: ~~how big is your initramfs today?~~ (nevermind, < 10M), How big is the composefs for it?

amnoni · 2024-09-27T09:21:19Z

+Kuznetsov, Vitaly ***@***.***> +Daniel Berrange ***@***.***>

…

On Fri, Sep 27, 2024 at 12:01 PM Colin Walters ***@***.***> wrote: We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support *too* many paths). The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs". The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also. That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.) Let's be a bit more specific: how big is your initramfs today? How big is the composefs for it? — Reply to this email directly, view it on GitHub <#332 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEB36BVBPT2GMGIMGQFYJC3ZYUNHXAVCNFSM6AAAAABNWYECY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZYHAYDAMBWHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ericcurtin · 2024-09-27T10:11:32Z

I recommend people take at the Android Boot Image and composefs implementation in cs9 auto FWIW. Android Boot Image is a kernel+dtb+cmdline+initramfs blob, it's very similar to UKI.

We're not going to break C9S auto. We will continue to support the way things work today. A big advantage of composefs is flexibility - there's multiple ways to do things (at the same time of course we don't want to support too many paths).

Understood, this feedback is not intended to block any efforts.

The advantage of the "rootfs-meta-in-initramfs" model in a nutshell is there is no extra keys/signatures required other than the Secure Boot one. But again, the existing way ostree+composefs works will clearly continue to work - and isn't specific to ostree, it's just "key in the initramfs covers verifies signature of digest of composefs".

I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.

The other reason is we run on some platforms that have a hard limit of 64M/32M for kernel+cmdline+dtb+initramfs combined. We have a stripped down kernel for this purpose also.

That said I think a general approach many use cases (including yours) should be going to is keeping the main root small anyways and having most of the bits in containers, i.e. mount real root, pivot, then go into the real root, mount further container images via composefs dynamically (verifying their signature however one wants...etc.)

We should favour containers when possible, but there are cases where we cannot use containers. I think there are more scalable solutions than this.

Let's be a bit more specific: ~~how big is your initramfs today?~~ (nevermind, < 10M), How big is the composefs for it?

I'll build an OS image sometime for exact measurements, need to leave early today for a wedding, so it will be Monday...

There are also no definite sizes for these things in the Automotive OS, but I'll post the minimal sizes anyway. A partner may want to add something to initramfs, may want to add a camera application to composefs (these advanced camera applications can be huge and are not suitable for containers).

cgwalters · 2024-09-27T12:10:46Z

I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.

One detail here is that assuming you do the "transient key" model, you throw away reproducible builds - was mentioned in an ASG talk. The "static key" model solves that, but...hmm, I think has other problems.

Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest or something that and then we know to look in /objects/<digest>, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.

ericcurtin · 2024-09-27T12:28:20Z

I think it's easy to extend the trust from secure boot key to rootfs, just chain checksums/digests.

One detail here is that assuming you do the "transient key" model, you throw away reproducible builds - was mentioned in an ASG talk. The "static key" model solves that, but...hmm, I think has other problems.

Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest or something that and then we know to look in /objects/<digest>, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.

^ This is what I mean "rootfs.digest" file concept... That's basically what we do in the automotive distro, it scales better...

cgwalters · 2024-09-27T12:38:26Z

That's basically what we do in the automotive distro, it scales better...

Is it though? Aren't you using the ostree+composefs integration which does this with signature covering the ostree commit, which has the composefs digest? That's all that ostree-prepare-root.service does today...and hence it requires a key.

I think what happened is probably a conceptual overlap between the ostree commit and the composefs. Today composefs is just awkwardly glued onto the side of ostree (not a criticism, doing more starts to get hard, but now we're at that point where doing the hard things is worth it for a cleanup).

But yes we could change ostree-prepare-root.service to look for /usr/lib/ostree/composefs.meta in the initramfs which would be a pair of:

path to composefs blob (which today is deployment specific)
its expected digest

Hmm maybe yes the conceptual conflict was between ostree commits and the composefs blob, but if we're treating it as canonical then yeah my instinct here is:

Invent /composefs/objects as a recommended standard thing (or well, I guess it could be /usr/lib/composefs/objects in the physical root...dunno)
Change ostree to also link() the ostree-composefs into that directory based on its fsverity digest
Now we don't need both a path and a digest in the initramfs, and can standardize /usr/lib/composefs/rootfs.digest per above

alexlarsson · 2024-09-27T14:24:01Z

Anyways...wait...why don't we just have the expected fsverity digest of the composefs in the UKI as e.g. /usr/lib/composefs/rootfs.digest or something that and then we know to look in /objects/<digest>, verify its digest against the expected and mount that? Why would it be any harder than that? I feel like I must be missing something...a bit sleep deprived but I can't think of any problems.

Generally this doesn't work with ostree because the UKI is stored in the ostree tree, so it becomes a recursive cycle. We break the cycle by using the one-time key.

alexlarsson · 2024-09-27T14:26:13Z

It could work in a system where the UKI and the rootfs are completely independent though.

cgwalters · 2024-09-27T14:32:25Z

Generally this doesn't work with ostree because the UKI is stored in the ostree tree, so it becomes a recursive cycle

Yeah, I remember this. But we can also break that cycle via just excluding the UKI from the composefs. That seems quite simple to do.

jbtrystram · 2024-09-27T14:57:37Z

But we can also break that cycle via just excluding the UKI from the composefs. That seems quite simple to do.

That is the conclusion we came to with @travier . It's okay because the uki is signed so not having it covered by fsverity does not matter

allisonkarlitskaya · 2024-09-27T15:46:42Z

It could work in a system where the UKI and the rootfs are completely independent though.

This is the situation I have in my head. I imagine that we have a system image in the form of a container and a UKI somewhere in a "special" path in that image that does not become part of the composefs, but goes directly into the EFI ESP.

allisonkarlitskaya mentioned this issue Sep 6, 2024

all: add 'copy' mount option #334

Draft

cgwalters added area/booting Issues related to booting with composefs enhancement New feature or request labels Sep 9, 2024

This was referenced Sep 16, 2024

composefs end state "v1" goal containers/storage#2095

Open

support deploying a composefs directly ostreedev/ostree#3291

Open

cgwalters mentioned this issue Sep 26, 2024

Support specifying deployment in ostree= kernel arg directly ostreedev/ostree#3314

Open

cgwalters mentioned this issue Sep 27, 2024

UKI/systemd-boot tracker containers/bootc#806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: "upper layer" (erofs) inside of initramfs #332

idea: "upper layer" (erofs) inside of initramfs #332

allisonkarlitskaya commented Sep 5, 2024

allisonkarlitskaya commented Sep 5, 2024

allisonkarlitskaya commented Sep 5, 2024

travier commented Sep 5, 2024

travier commented Sep 9, 2024 •

edited

Loading

travier commented Sep 9, 2024

cgwalters commented Sep 9, 2024

cgwalters commented Sep 9, 2024

travier commented Sep 13, 2024

travier commented Sep 13, 2024

jbtrystram commented Sep 13, 2024

cgwalters commented Sep 13, 2024

cgwalters commented Sep 13, 2024

ericcurtin commented Sep 27, 2024 •

edited

Loading

ericcurtin commented Sep 27, 2024 •

edited

Loading

cgwalters commented Sep 27, 2024 •

edited

Loading

amnoni commented Sep 27, 2024 via email

ericcurtin commented Sep 27, 2024

cgwalters commented Sep 27, 2024

ericcurtin commented Sep 27, 2024 •

edited

Loading

cgwalters commented Sep 27, 2024

alexlarsson commented Sep 27, 2024

alexlarsson commented Sep 27, 2024

cgwalters commented Sep 27, 2024 via email

jbtrystram commented Sep 27, 2024 •

edited

Loading

allisonkarlitskaya commented Sep 27, 2024

idea: "upper layer" (erofs) inside of initramfs #332

idea: "upper layer" (erofs) inside of initramfs #332

Comments

allisonkarlitskaya commented Sep 5, 2024

allisonkarlitskaya commented Sep 5, 2024

allisonkarlitskaya commented Sep 5, 2024

travier commented Sep 5, 2024

travier commented Sep 9, 2024 • edited Loading

travier commented Sep 9, 2024

cgwalters commented Sep 9, 2024

cgwalters commented Sep 9, 2024

travier commented Sep 13, 2024

travier commented Sep 13, 2024

jbtrystram commented Sep 13, 2024

cgwalters commented Sep 13, 2024

cgwalters commented Sep 13, 2024

ericcurtin commented Sep 27, 2024 • edited Loading

ericcurtin commented Sep 27, 2024 • edited Loading

cgwalters commented Sep 27, 2024 • edited Loading

amnoni commented Sep 27, 2024 via email

ericcurtin commented Sep 27, 2024

cgwalters commented Sep 27, 2024

ericcurtin commented Sep 27, 2024 • edited Loading

cgwalters commented Sep 27, 2024

alexlarsson commented Sep 27, 2024

alexlarsson commented Sep 27, 2024

cgwalters commented Sep 27, 2024 via email

jbtrystram commented Sep 27, 2024 • edited Loading

allisonkarlitskaya commented Sep 27, 2024

travier commented Sep 9, 2024 •

edited

Loading

ericcurtin commented Sep 27, 2024 •

edited

Loading

ericcurtin commented Sep 27, 2024 •

edited

Loading

cgwalters commented Sep 27, 2024 •

edited

Loading

ericcurtin commented Sep 27, 2024 •

edited

Loading

jbtrystram commented Sep 27, 2024 •

edited

Loading