Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubblewrap not working in strata other than init #245

Open
ethan2-0 opened this issue Dec 26, 2021 · 5 comments
Open

Bubblewrap not working in strata other than init #245

ethan2-0 opened this issue Dec 26, 2021 · 5 comments

Comments

@ethan2-0
Copy link

Bubblewrap errors when run in a stratum other than the stratum that provides init. Error message is at the bottom of the steps to reproduce. It's not clear to me why this is happening, though looking at the output of strace, it seems I'm getting EPERM on a clone syscall with flags=CLONE_NEWNS|CLONE_NEWUSER|SIGCHLD, which makes sense given the error message.

To reproduce:

In my case, my init strat is named debian, using Debian stable, and I've also created test-strat, also Debian stable. Both have bubblewrap installed.

$ brl which
debian
$ brl deref init
debian
$ bwrap --version
bubblewrap 0.4.1
$ brl version
Bedrock Linux 0.7.24 Poki
$ bwrap --dev-bind / / echo hi
hi
$ strat test-strat
$ bwrap --dev-bind / / echo hi
bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.
@cptpcrd
Copy link
Contributor

cptpcrd commented Dec 26, 2021

Hmm, a quick look at the clone(2) man page shows this:

EPERM (since Linux 3.9)
CLONE_NEWUSER was specified in the flags mask and the caller is in a chroot environment (i.e., the caller's root directory does not match the root directory of the mount namespace in which it resides).

This appears to be a security feature, likely to prevent unprivileged users from using user namespaces to escape chroots. Since Bedrock uses chroots to run non-init strata, this prevents creating user namespaces outside of the init strata.

This check also doesn't appear to have any exceptions. It might be possible to work around it by creating a new mount namespace when switching strata, but 1) I'm not 100% sure that would work and 2) if Bedrock did that by default, it would probably break things.

@paradigm
Copy link
Member

paradigm commented Dec 27, 2021

I agree that clone(2) EPERM item is likely the culprit. I also agree that per-stratum mount namespaces would likely fix this issue.

In the immediate future, work-arounds include:

  1. Manually running chmod u+s /path/to/bwrap as root. AFAIK bwrap is designed to be run as setuid in case the kernel has non-privileged user namespaces disabled. However, I certainly understand unnecessary setuid being undesirable.
  2. Pairing init with bwrap. This constraint is undesirable as well.

Bedrock 0.7.x relies on the common mount namespace pervasively. A ready example is brl which which compares PID 1's mount table against another PID's to determine which stratum provides the second PID. I don't think a quick fix via a point update is viable.

The design of the upcoming 0.8.x is still somewhat fluid. I can try to incorporate per-stratum mount namespaces into it, although I can't make any promises. Off the top of my head it may introduce some design regressions:

  • strat currently only requires CAP_SYS_CHROOT. This will likely require strat be full blown setuid.
  • Maybe small strat performance hit? I think it would have to do at least one more system call.
  • This will effectively bump the Bedrock minimum kernel version to 3.8 for setns().
  • 0.7 requires a system-wide daemon ("crossfs") and a per-stratum daemon ("etcfs"). I was hoping 0.8 would drop that to just a single system-wide daemon ("brld") with at most one idle process/thread. It's not clear to me if this could still be met; it may require, at a minimum, a per-stratum thread to hold open / control the namespace. The daemon can just hold a file descriptor per namespace, no need for constant threads.
  • Creating new global directories (equivalent of bedrock.conf's share = lines) will probably require a reboot. AFAIK it's not possible to create new bind mounts across mount namespaces outside of a shared subtree mount.
  • When combine with some other planned features, this will likely end up increasing the total number of mount points.
  • I don't have any ideas for unprivileged brl which -p that don't require either setuid or querying some root process, neither of which are desirable. Unprivileged brl which -p support may be dropped. If we go with a per-stratum brld thread, that both thread and the pid in question will have the same publicly-readable mountinfo specific to the namespace. We could do something like compare awk '$5 == "/" {print$1;exit}' /proc/.../mountinfo output.

While in principle having bwrap from any stratum just work is certainly desirable, it's not obvious to me if these trade-offs are worthwhile. It's also not obvious to me that it's not. I'll need to think about it.

@paradigm
Copy link
Member

While not the main focus of this issue, I should point out that neither querying for the current shell then running a command like so:

$ brl which
debian
[...]
$ bwrap --dev-bind / / echo hi
hi

nor specifying a shell then running a command like so:

$ strat test-strat
$ bwrap --dev-bind / / echo hi
bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.

are guaranteed to get the command from the shell's stratum. Consider what happens in both those cases if bwrap is installed in a third stratum and not in either of those strata. Another example which may be easier to think about is:

$ strat debian
$ brl which
debian
$ grep "^NAME" /etc/os-release
NAME="Debian GNU/Linux"
$ pacman --help | head -n1
usage:  pacman <operation> [...]

Keep in mind that, despite the discussion around namespaces and chroot, Bedrock is not containers.

Rather, I recommend either querying specifically about the command in question (rather than the shell):

$ brl which bwrap # if I run `bwrap` in this context, which stratum provides it?
debian
$ bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
$ strat test-strat
$ brl which bwrap # if I run `bwrap` in this context, which stratum provides it?
test-strat
$ bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]

or just explicitly specifying which stratum's instance is desired:

$ strat debian bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
$ strat test-strat bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]

@paradigm
Copy link
Member

paradigm commented Jan 5, 2022

I've spent some time exploring the possibility of per-stratum mount namespaces. I think I've confirmed that this fixes the issue in a local hacky test. I also think I've found a way forward.

Enabling a stratum should:

  • Create a per-stratum mount namespace
  • Ensure the root directory is the root of the mount namespace
  • Ensure shared mounts are available within the namespace to share with the rest of the system.

In some hacky tests, I think I've confirmed we can do this by:

  • Ensuring all desired shared mounts are setup first.
    • Creating a new global directory outside /etc, which is implemented via shared mounts, will require restarting all strata. Since we can't restart the init stratum on-the-fly without crashing the system, this will effectively require a reboot.
    • New to Naga, /bedrock/strata will be one of these shared/global mounts by default.
  • Run any pre-stratum-enable hooks which may mount a stratum root at /bedrock/strata/<stratum-being-enabled>
    • This is a new-ish feature for Naga. The fact /bedrock/strata is shared makes me think this should work fine at this point.
  • Clone/unshare the new mount namespace.
  • Bind-mount /bedrock/strata/<stratum-being-enabled> to some <new-root> and ensure both the bind-mount and its parent mount are private mounts.
    • This is a requisite for pivot_root
  • mount --move all mounts of interest to <new-root>/<path>.
    • man mount says --move one cannot move a mount residing under a shared mount. However, in my testing it with the above setup it does appear to work. If I confused a step and it doesn't actually work, we might be able to tower-of-hanoi things into place.
  • pivot_root <new-root> <old-root>
  • umount <old-root>

We also need some system to track the mount namespaces, associate them with strata, and a way for strat to setns the correct namespace. I came up with three possibilities to pursue here:

  • A daemon manages a thread per stratum/namespace
    • strat could then open and setns the /proc/<daemon-pid>/task/<stratum-thread-tid>/ns/mnt paths.
    • brl enable/brl disable could manage symlinks in some global location to associate the /proc/.../mnt paths with stratum names.
    • This will require a thread per namespace. Some people may not like seeing lots of threads in process+thread lists, e.g. htop, even if they utilize zero CPU cycles and very little memory.
  • brl enable creates the namespace and bind-mount its /proc/.../mnt file to save the namspace.
  • A daemon unshares the namespace, opens its /proc/.../mnt file, and tracks the file descriptor.
    • No need for per-stratum tracking threads or tracking mount points.
    • strat could then communicate with the daemon via a socket to get the file descriptor to setns. strat can open/setns straight from /proc/<daemon-pid>/fd/<fd>. The daemon can surface which of its file descriptors corresponds to which stratum mount namespace via a symlink to it through FUSE. FUSE can cache symlinks on the kernel side such that repeated rapid access is very fast.
    • I haven't actually confirmed the kernel will hold a namespace open so long as there is an open file descriptor, but it seems reasonable. I think I confirmed this works.
    • I haven't actually confirmed UNIX domain sockets can transfer mount namespace file descriptors, but it seems reasonable given the fact they can transfer other kinds of file descriptors. Not necessary if we read straight from /proc/<daemon-pid>/fd/<fd>
    • While I haven't actually benchmarked anything, I am worried the IPC overhead will result in a noticeable strat performance regression from Poki. IPC overhead concern was with sockets; reading /proc/<daemon-pid>/fd/<fd> resolves this.

Sadly this retains most of the regressions I was worried about earlier; I couldn't find ways around them. However, as a bonus, it might improve the system shutdown experience. I think the kernel automatically handles unmounting mount points in namespaces with no processes/tracking-mounts/file-descriptors (https://unix.stackexchange.com/questions/212172/what-happens-if-the-last-process-in-a-namespace-exits). This will likely both slightly improve shutdown time and resolve this issue.

@paradigm
Copy link
Member

paradigm commented Jan 7, 2022

After thinking about this even more, I think per-stratum namespaces would also help with:

The trade-off seems more and more in favor of per-stratum namespaces. I'm going to start planning a big refactor of the 0.8 efforts in this direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants