Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mount: add support for optionally skipping loopback device #361

Open
allisonkarlitskaya opened this issue Sep 26, 2024 · 13 comments
Open

mount: add support for optionally skipping loopback device #361

allisonkarlitskaya opened this issue Sep 26, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@allisonkarlitskaya
Copy link
Collaborator

This is cool: https://lore.kernel.org/lkml/CAMuHMdVqa2Mjqtqv0q=uuhBY1EfTaa+X6WkG7E2tEnKXJbTkNg@mail.gmail.com/T/

@allisonkarlitskaya
Copy link
Collaborator Author

The complexity of the mount code is growing a lot:

  • maybe we opened the fd by filename or got it from the user
  • maybe we get an extra fd because of the memfd copy
  • maybe we get an extra fd because of loopback devices (or in the future, maybe not)

So this storing the fd in the state thing is probably not gonna survive... There's just too many fds now...

The thing about opening vs. getting an fd from the user is pretty boring and is well handled in the two separate entrypoint APIs for each of those cases, but then we need to go on a path:

  • replace that fd with a memfd, if appropriate
  • replace that fd with a loop device if appropriate
  • mount that

plus all the cleanups of optionally closing fds if we were the ones that opened it. I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one.

Basically, the handling of trying to make sure we rmdir() and close() stuff as appropriate is getting a bit too complicated... particularly when mixed with various early error exits...

@alexlarsson
Copy link
Collaborator

This is definitely cool, and will help tons to avoid loads of loopback devices e.g. in the container image case. I agree that the mount code complexity is growing, in particular since we need to support legacy mount APIs, etc. But, this need only be included in the modern API mount codepath, and I believe the complexity is worth it.

@allisonkarlitskaya
Copy link
Collaborator Author

@alexlarsson what's the story behind needing to support the legacy mount APIs?

@alexlarsson
Copy link
Collaborator

@allisonkarlitskaya Its needed on e.g. rhel9 at least I belive.

@cgwalters cgwalters added the enhancement New feature or request label Sep 30, 2024
@cgwalters
Copy link
Contributor

As I understand things today, if we go with the current design in #332 then we don't need the memfd path, right?

Skipping the loopback though would be quite valuable down the line, but is IMO not critical right now again versus just fleshing out the status quo.

I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one.

That seems fine to me if it simplifies the code though.

@allisonkarlitskaya
Copy link
Collaborator Author

Ya, I feel like the memfd thing was an interesting idea that got us on the right path but maybe not all that useful in the end...

@allisonkarlitskaya
Copy link
Collaborator Author

ps: skipping the loopback device still doesn't let us mount in containers, unfortunately. fsconfig(3, FSCONFIG_CMD_CREATE, ...) fails with EPERM because even though we're root in our own mount namespace we're not the "real root". That's down to the call to mount_capable() in vfs_cmd_create() in fsopen.c which checks that we're either "real root" or that our chosen filesystem has FS_USERNS_MOUNT (which few do, and erofs is not among them). Incidentally, overlayfs is (which makes sense since containers depend on this)...

@allisonkarlitskaya
Copy link
Collaborator Author

A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel. I wonder if we could convince erofs (and vfs) folks to trust filesystem images that come from non-mutable sources, eg:

  • sealed memfd; or
  • fs-verity enabled source

@cgwalters
Copy link
Contributor

ps: skipping the loopback device still doesn't let us mount in containers,

This subthread is I think covered by containers/storage#2099 (also xref https://github.com/cgwalters/composefs-oci-experimental?tab=readme-ov-file#mounting-composefs-as-non-root )

A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel.

That's I think nowadays covered by https://lwn.net/Articles/941764/ - but no the real problem remains (mostly for more complex filesystems) that offline crafted input can result in memory corruption - the XFS maintainer mentioned things like having file blocks point to metadata as one that would start to be quite expensive to check. That's mainly a concern for writable filesystems - erofs is regularly fuzzed and is I think fairly robust but...

Anyways I think the "composefs as a service" is the best fix here.

@alexlarsson
Copy link
Collaborator

Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen.

I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files.

@hsiangkao
Copy link
Contributor

hsiangkao commented Oct 2, 2024

Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen.

I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files.

As I've said in containers/storage#2099 (comment), personally it won't be hard to enable EROFS with FS_USERNS_MOUNT (e.g. disable compression).
From the centos 7 backport, currently the core part of EROFS (un-encoded metadata/data path and xattr support) is 2150 LOC (https://github.com/erofs/kmod-erofs/tree/main/src). So on the simplicity side, I think un-encoded EROFS is simple enough.
Yet I don't think VFS maintainer will accept this (regardless of fuzzing, loc or others), also they provide some alternative ways to achieve that too.

@j1nx
Copy link

j1nx commented Oct 6, 2024

Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage.

Does this help?
https://www.phoronix.com/news/FUSE-IDMAPPED-Mounts-6.12

@giuseppe
Copy link
Member

giuseppe commented Oct 7, 2024

Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage.

Does this help? https://www.phoronix.com/news/FUSE-IDMAPPED-Mounts-6.12

I think that won't help with rootless idmapped mounts. We either need FUSE passthrough, or overlay in a user namespace needs to honor the .redirect extended attribute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants