Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

composefs integration in the overlay driver #1646

Merged
merged 7 commits into from
Jul 5, 2023

Conversation

giuseppe
Copy link
Member

ComposeFS has been integrated into the overlay driver. The EROFS filesystem is now used to mount the file system metadata. Each layer is mounted individually, but we can change this later and merge multiple layers into a single file system.

Filesystem-level integrity verification is now enabled through the implementation of fs-verity. This enhancement ensures the integrity of both the composefs blob and data files, providing an additional layer of security. Note that this implementation is currently considered a best-effort attempt since there is no control yet on how to configure it as well as there is no support from the overlay driver in the kernel (it is being worked on upstream).

It requires the composefs-from-json tool installed and accessible in the path from the https://github.com/containers/composefs project.

pkg/chunked/storage_linux.go Outdated Show resolved Hide resolved
pkg/chunked/storage_linux.go Show resolved Hide resolved
drivers/overlay/composefs_supported.go Outdated Show resolved Hide resolved
drivers/overlay/composefs_supported.go Outdated Show resolved Hide resolved
drivers/overlay/composefs_supported.go Show resolved Hide resolved
drivers/overlay/composefs_supported.go Show resolved Hide resolved
@cgwalters
Copy link
Contributor

Only images that are using the zstd:chunked and eStargz format can be
used in this way since the metadata is stored in the image itself.

Wait...why? What metadata?

@alexlarsson
Copy link
Contributor

See e.g. https://www.redhat.com/sysadmin/faster-container-image-pulls
Basically, there is a json format with file metadata and digests that is stored in the compression chunks in the oci image tar layers. This can be used to detect which files in the tarball need to be downloaded (via ranged http gest), and the same file can be used to generate the composefs image.

@giuseppe giuseppe force-pushed the chunked-libcomposefs branch 2 times, most recently from cfea72a to 8cec35c Compare June 19, 2023 15:39
@cgwalters
Copy link
Contributor

OK, yes I see this JSON format was introduced in a0ff89e

A few things here. I think a few choices in this JSON format are suboptimal:

  • using string for symlink targets requires them to be valid Unicode; I would bet money that there are container images with non-Unicode filenames.
  • using string for the xattr names has a similar conceptual problem (although non-Unicode xattr names are probably very rare and weird if they exist). But xattr values is more problematic, because file capabilities aka security.capability are not valid Unicode.
  • I've already mentioned that IMO parsing human-readable dates here is just unnecessarily inefficient

So I would still propose that this JSON format should stay as a c/storage thing, and we have a different format in composefs upstream.

(Or, c/storage could switch to a different format, because I believe this tooling in c/storage is still experimental)

@alexlarsson
Copy link
Contributor

  • But xattr values is more problematic, because file capabilities aka security.capability are not valid Unicode.

Some xattrs also contain a terminating zero byte. selinux spring to mind, although i may be misremembering.

@cgwalters
Copy link
Contributor

Another thing about the times: EROFS seems to (as you'd expect) make "ctime == mtime == atime", because for a read-only system these must all be the same; (and it further optimizes the common case of squashing all timestamps in an image to a common build time).

Another way to say this is that it doesn't support any of these times having different values, and hence if we care about this use case (as is implicit in these being different values in the JSON here) we have a problem. I somehow doubt there are container images out in the world where something actually depends on having the atime being honored even if it's encoded in the tar stream.

@cgwalters
Copy link
Contributor

Well, I don't want to spend too much time going down this rabbit hole, but it sure looks to me like the default GNU tar option won't serialize atime/ctime, and GNU tar is commonly used to make "base images". And it looks to me like the same is true of the current c/storage code; I don't see it assigning to AccessTime, though the code tries to support unpacking it at least.

Anyways so...do we really need to care about atime/ctime? I am doubtful. But regardless again, it can't be in composefs-from-json as there is no support for it.

@cgwalters
Copy link
Contributor

Another concern I have with all of this is that we're going to be passing this JSON to C code, which pretty quickly hands it off to libcomposefs (also in C) - and I don't think the current libcomposefs C library is written in a "tries hard to be resistant to malicious input" style.

And this JSON may contain malicious input.

(I see that composefs-from-json does some sandboxing on its own. But that's kind of weird because it's designed to be run from a container runtime...why can't we just have the caller do sandboxing? Among other things, the calling container runtime is in a better position to e.g. set up a full read-only mount namespace, only allowing writing to the target place)

But more generally, we should probably try to better define the security boundaries here. In general I think the chromium "Rule of 2" is a good idea. OK, in this case we are at least using the Go JSON parser first, then passing that pre-validated buffer to C, but still.

@giuseppe
Copy link
Member Author

the xattrs are encoded in base64 so we can use a string: https://github.com/containers/storage/blob/main/pkg/chunked/compressor/compressor.go#L378

and composefs-from-json handles that: https://github.com/containers/composefs/blob/main/tools/composefs-from-json.c#L347

@giuseppe giuseppe force-pushed the chunked-libcomposefs branch 2 times, most recently from ad4c0d5 to 5fb8a30 Compare June 19, 2023 20:37
@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 20, 2023

do we really need to care about atime/ctime? I am doubtful.

atime and ctime are much of filesystem internal state and volatile. I'm not sure if they are useful (ctime was once considered for EROFS but ctime could be changed due to inode metadata changes, so I and Android guys decided to shift it to mtime finally which also keeps in sync with tarball. CRFS[1] also only keeps mtime). Apart from these, we also have crtime.

Anyway, if there could be further use cases, some xattrs could be used to keep that instead but I think use cases of ctime, atime are quite limited.

[1] https://github.com/google/crfs/blob/master/stargz/stargz.go#L127

@giuseppe giuseppe force-pushed the chunked-libcomposefs branch 5 times, most recently from 174a478 to 3ca6fec Compare June 20, 2023 13:19
@rhatdan
Copy link
Member

rhatdan commented Jun 21, 2023

@alexlarsson @cgwalters @nalind @mtrmac PTAL

drivers/overlay/overlay.go Outdated Show resolved Hide resolved
@nalind
Copy link
Member

nalind commented Jun 21, 2023

OK, yes I see this JSON format was introduced in a0ff89e

A few things here. I think a few choices in this JSON format are suboptimal:

* using `string` for symlink targets requires them to be valid Unicode; I would bet money that there are container images with non-Unicode filenames.

Strictly speaking, Go does not require string values to be valid UTF-8.

@cgwalters
Copy link
Contributor

cgwalters commented Jul 3, 2023 via email

@giuseppe
Copy link
Member Author

giuseppe commented Jul 3, 2023

,
unix.O_WRONLY|unix.O_CREAT|unix.O_TRUNC|unix.O_EXCL, 0o644)
I need to pass the fd to the writer-json process, it cannot be O_CLOEXEC
Go will dup the fd when passing via the Command struct extra files

I've changed it to be O_CLOEXEC and use ExtraFiles

@giuseppe giuseppe force-pushed the chunked-libcomposefs branch 2 times, most recently from e96e7bd to dd1c18d Compare July 3, 2023 19:36
}

func mountErofsBlob(blobFile, mountPoint string) error {
loop, err := loopback.AttachLoopDevice(blobFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not setting the loopback device read-only. I think we should do that. (composefs does)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it should enable direct-io to avoid caching metadata twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, ideally it should pass "noacl" if the composefs header flag that says there are no acls in the file is set (this will improve performance quite a lot).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, noacl is in the mount options below, not the loop properties.

}
defer loop.Close()

return unix.Mount(loop.Name(), mountPoint, "erofs", unix.MS_RDONLY, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. here, we should have noacl option if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even sure if these are supported, I cannot find anywhere in c/storage the posix_acl_access string. Until it is needed, I've hardcoded ro,noacl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if an OCI layer tar has the ACL xattr set, historically that would have been extracted to the lowerdir file, and when userspace looked at the file it would have seen the ACL, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I have no idea if ACLs work in the wild at all though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped it for now, let's be safe and handle the optimization later. I'd need to add the check in pkg/chunked and store somewhere that the image has no ACLs.

}
if erofsMount != "" {
if needsIDMapping {
if err := idmap.CreateIDMappedMount(erofsMount, erofsMount, idmappedMountProcessPid); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mount.composefs I do the idmap as part of the erofs mount, using the mount_setattr syscall. Maybe its hard to do that here though. It needs the new mount api.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I need some refactoring since we already have that code in pkg/idmap, for now I've left a TODO in the code


_, _, e1 := syscall.Syscall(unix.SYS_IOCTL, uintptr(fd), uintptr(unix.FS_IOC_ENABLE_VERITY), uintptr(unsafe.Pointer(&enableArg)))
if e1 != 0 {
return fmt.Errorf("failed to enable verity for %q: %w", description, e1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should silently accept EEXIST (fs-verity already enabled for the file).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, couldn't this happen if a layer file is hard-linked from another layer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this could happen with hard link deduplication

@alexlarsson
Copy link
Contributor

I added a few comments, but overall I think this looks good. The main issue I see right now is that it depends on the composefs-from-json helper, which is currenly not installed from composefs. We have to chose how to do this. Alternatives:

  1. Install and support if in composefs
  2. Import the code into mkcomposefs and use that
  3. Call libcomposefs from containers/storage, reimplementing the json stuff.

I don't know how painful 3 is, calling C is always icky. However, we do already parse the json in the go code, so not having to do that twice is a win.

The file descriptor was not closed before, thus leaking all the opened files.

Signed-off-by: Giuseppe Scrivano <[email protected]>
so that they can be stored by their digest

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member Author

giuseppe commented Jul 4, 2023

I've looked into the comments that do not require touching other packages, I can take care of other stuff later with more targetted PR. The composefs integration though works only with single layered images at the moment, since multiple data layers conflict with each other, so this still needs the "data only" feature in overlay and we can avoid creating the whiteout files as part of them

@alexlarsson
Copy link
Contributor

Data only overlayfs support has landed and will be in 6.5 btw.

@alexlarsson
Copy link
Contributor

Given this is not enabled by default currently i think it makes a lot to merge this as is.

This commit introduces support for ComposeFS using the EROFS
filesystem to mount the file system metadata.

The current implementation allows each layer to be mounted
individually.

Only images that are using the zstd:chunked and eStargz format can be
used in this way since the metadata is stored in the image itself.

In future support for arbitrary images can be added.

Signed-off-by: Giuseppe Scrivano <[email protected]>
at the moment it is a best-effort implementation to enable fs-verity
for the composefs blob as well as for the data files.

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member Author

giuseppe commented Jul 5, 2023

I don't know how painful 3 is, calling C is always icky. However, we do already parse the json in the go code, so not having to do that twice is a win.

I've a WIP branch with a CGo implementation: https://github.com/giuseppe/storage/tree/chunked-libcomposefs-Cgo so we could replace the current composefs-from-json dependency.

I'd need to do some sandboxing and ideally re-execute Podman.

I've dropped the noacl for now, let's take care of this optimization later. The code in overlay doesn't know about the TOC file and the zstd:chunked metadata, so we need to check this information in the chunked package and then just reuse from overlay.

Given this is not enabled by default currently i think it makes a lot to merge this as is.

I am trying to reduce the number of changes so we can finally merge this PoC and iterate again from here, and also don't want to get back from PTO straight to merge conflicts :-)

@rhatdan
Copy link
Member

rhatdan commented Jul 5, 2023

LGTM

@rhatdan rhatdan merged commit f8e3fcf into containers:main Jul 5, 2023
18 checks passed
@giuseppe
Copy link
Member Author

giuseppe commented Jul 5, 2023

opened a PR to handle the noacl mount flag: #1663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants