mount: add support for optionally skipping loopback device #361

allisonkarlitskaya · 2024-09-26T08:08:30Z

This is cool: https://lore.kernel.org/lkml/CAMuHMdVqa2Mjqtqv0q=uuhBY1EfTaa+X6WkG7E2tEnKXJbTkNg@mail.gmail.com/T/

allisonkarlitskaya · 2024-09-26T08:13:13Z

The complexity of the mount code is growing a lot:

maybe we opened the fd by filename or got it from the user
maybe we get an extra fd because of the memfd copy
maybe we get an extra fd because of loopback devices (or in the future, maybe not)

So this storing the fd in the state thing is probably not gonna survive... There's just too many fds now...

The thing about opening vs. getting an fd from the user is pretty boring and is well handled in the two separate entrypoint APIs for each of those cases, but then we need to go on a path:

replace that fd with a memfd, if appropriate
replace that fd with a loop device if appropriate
mount that

plus all the cleanups of optionally closing fds if we were the ones that opened it. I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one.

Basically, the handling of trying to make sure we rmdir() and close() stuff as appropriate is getting a bit too complicated... particularly when mixed with various early error exits...

alexlarsson · 2024-09-26T14:00:19Z

This is definitely cool, and will help tons to avoid loads of loopback devices e.g. in the container image case. I agree that the mount code complexity is growing, in particular since we need to support legacy mount APIs, etc. But, this need only be included in the modern API mount codepath, and I believe the complexity is worth it.

allisonkarlitskaya · 2024-09-27T11:51:31Z

@alexlarsson what's the story behind needing to support the legacy mount APIs?

alexlarsson · 2024-09-30T09:49:46Z

@allisonkarlitskaya Its needed on e.g. rhel9 at least I belive.

cgwalters · 2024-10-01T18:01:31Z

As I understand things today, if we go with the current design in #332 then we don't need the memfd path, right?

Skipping the loopback though would be quite valuable down the line, but is IMO not critical right now again versus just fleshing out the status quo.

I start to think that maybe we could dup() the user's fd just to simply things there (ie: we always close it) and close() each fd from a helper before we return the "next layer" one.

That seems fine to me if it simplifies the code though.

allisonkarlitskaya · 2024-10-01T18:14:57Z

Ya, I feel like the memfd thing was an interesting idea that got us on the right path but maybe not all that useful in the end...

allisonkarlitskaya · 2024-10-01T18:20:21Z

ps: skipping the loopback device still doesn't let us mount in containers, unfortunately. fsconfig(3, FSCONFIG_CMD_CREATE, ...) fails with EPERM because even though we're root in our own mount namespace we're not the "real root". That's down to the call to mount_capable() in vfs_cmd_create() in fsopen.c which checks that we're either "real root" or that our chosen filesystem has FS_USERNS_MOUNT (which few do, and erofs is not among them). Incidentally, overlayfs is (which makes sense since containers depend on this)...

allisonkarlitskaya · 2024-10-01T18:27:24Z

A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel. I wonder if we could convince erofs (and vfs) folks to trust filesystem images that come from non-mutable sources, eg:

sealed memfd; or
fs-verity enabled source

cgwalters · 2024-10-01T18:48:33Z

ps: skipping the loopback device still doesn't let us mount in containers,

This subthread is I think covered by containers/storage#2099 (also xref https://github.com/cgwalters/composefs-oci-experimental?tab=readme-ov-file#mounting-composefs-as-non-root )

A lot of the caution in this "kernel shouldn't mount filesystems from untrusted sources" story is from the possibility that the user can modify the filesystem under the kernel.

That's I think nowadays covered by https://lwn.net/Articles/941764/ - but no the real problem remains (mostly for more complex filesystems) that offline crafted input can result in memory corruption - the XFS maintainer mentioned things like having file blocks point to metadata as one that would start to be quite expensive to check. That's mainly a concern for writable filesystems - erofs is regularly fuzzed and is I think fairly robust but...

Anyways I think the "composefs as a service" is the best fix here.

alexlarsson · 2024-10-02T10:22:23Z

Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen.

I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files.

hsiangkao · 2024-10-02T12:30:06Z

Yeah, EROFS is simpler than read-write OSes, but its not dead simple. The original in-kernel composefs was even simpler, and I believe something like that could eventually get fuzzed to be trustworthy for untrusted data. However, that ship has sailed, and I'm pretty sure general untrusted EROFS mounts are not gonna happen.

I wonder if we could handle unprivileged composefs mounts by just extracting the metadata from the erofs into a tmpfs. Maybe we can make that pretty fast by using io_uring to parallelize all the syscalls to create the files.

As I've said in containers/storage#2099 (comment), personally it won't be hard to enable EROFS with FS_USERNS_MOUNT (e.g. disable compression).
From the centos 7 backport, currently the core part of EROFS (un-encoded metadata/data path and xattr support) is 2150 LOC (https://github.com/erofs/kmod-erofs/tree/main/src). So on the simplicity side, I think un-encoded EROFS is simple enough.
Yet I don't think VFS maintainer will accept this (regardless of fuzzing, loc or others), also they provide some alternative ways to achieve that too.

j1nx · 2024-10-06T10:49:19Z

Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage.

Does this help?
https://www.phoronix.com/news/FUSE-IDMAPPED-Mounts-6.12

giuseppe · 2024-10-07T14:21:11Z

Apologies for chiming in on this without to much know how on the matter, however more than interested to see rootless composefs being implemented for podman/storage.

Does this help? https://www.phoronix.com/news/FUSE-IDMAPPED-Mounts-6.12

I think that won't help with rootless idmapped mounts. We either need FUSE passthrough, or overlay in a user namespace needs to honor the .redirect extended attribute

cgwalters added the enhancement New feature or request label Sep 30, 2024

allisonkarlitskaya mentioned this issue Oct 11, 2024

better loopback handling (hiding it) #144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mount: add support for optionally skipping loopback device #361

mount: add support for optionally skipping loopback device #361

allisonkarlitskaya commented Sep 26, 2024

allisonkarlitskaya commented Sep 26, 2024

alexlarsson commented Sep 26, 2024

allisonkarlitskaya commented Sep 27, 2024

alexlarsson commented Sep 30, 2024

cgwalters commented Oct 1, 2024

allisonkarlitskaya commented Oct 1, 2024

allisonkarlitskaya commented Oct 1, 2024

allisonkarlitskaya commented Oct 1, 2024

cgwalters commented Oct 1, 2024

alexlarsson commented Oct 2, 2024

hsiangkao commented Oct 2, 2024 •

edited

Loading

j1nx commented Oct 6, 2024

giuseppe commented Oct 7, 2024

mount: add support for optionally skipping loopback device #361

mount: add support for optionally skipping loopback device #361

Comments

allisonkarlitskaya commented Sep 26, 2024

allisonkarlitskaya commented Sep 26, 2024

alexlarsson commented Sep 26, 2024

allisonkarlitskaya commented Sep 27, 2024

alexlarsson commented Sep 30, 2024

cgwalters commented Oct 1, 2024

allisonkarlitskaya commented Oct 1, 2024

allisonkarlitskaya commented Oct 1, 2024

allisonkarlitskaya commented Oct 1, 2024

cgwalters commented Oct 1, 2024

alexlarsson commented Oct 2, 2024

hsiangkao commented Oct 2, 2024 • edited Loading

j1nx commented Oct 6, 2024

giuseppe commented Oct 7, 2024

hsiangkao commented Oct 2, 2024 •

edited

Loading