Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootless podman can't bind-mount allocdir #388

Open
optiz0r opened this issue Nov 10, 2024 · 5 comments
Open

rootless podman can't bind-mount allocdir #388

optiz0r opened this issue Nov 10, 2024 · 5 comments

Comments

@optiz0r
Copy link
Contributor

optiz0r commented Nov 10, 2024

Nomad considers filesystem permissions for the allocs directory to be outside of it's own security model (https://developer.hashicorp.com/nomad/docs/concepts/security)

Access (read or write) to the Nomad data directory - Information about the allocations scheduled to a Nomad client is persisted to its data directory. This would include any secrets in any of the allocation's file systems.

To protect the secrets written into job allocation directories from unprivileged local users with access to the nomad client, it's required to set restrictive permissions on the allocs directory or parent, such as 0700. The important part here is that the other permission does not include +x/1 to allow directory traversal, since secrets are written into subdirectories with accessible permissions (nobody:nobody 0777).

This seems to be fundamentally incompatible with rootless containers, since the unprivileged user needs to traverse into the alloc dir in order to stat them for bind-mounting into the container. Restrictive permissions yield Driver Failure errors such as the following on container startup

rpc error: code = Unknown desc = failed to start task, could not create container: cannot create container, status code: 500: {"cause":"permission denied","message":"statfs /data/nomad/server/alloc/1be2b692-465d-a1ac-54ff-e6f7a43c9fa4/alloc: permission denied","response":500}

One of the benefits of rootless containers and multiple sockets would be enabling stronger isolation between users on a host. Multiple sockets requires all the users which will run containers under nomad have access to the allocs directory, and therefore inherently all the secrets written to them for all jobs run by all users. This is sadly a dealbreaker for us, since it would allow secrets to be leaked across user boundaries.

The only way I can think to work around this would be nomad setting more restrictive permissions on the alloc directory itself (i.e. the one named after the job uid), e.g. setting ownership to match the podman socket owner, and 0700 permissions. Nomad itself when running as root would be able to bypass the restrictive permissions. Or POSIX ACLs on supported filesystems. I'm not sure if this can be practically implemented in the task driver alone, or if it would need support in Nomad core. At the very least, some information would need to be collected about which filesystem user the directory would need to be made accessible to. Currently the multiple-socket implementation doesn't understand which user "owns" the socket configured.

Alternatively, could this task driver bind-mount the alloc dir into some alternate path accessible by only the podman socket owner (e.g. beneath /run/user/UID), by bypass the more restrictive permission on the parent allocs dir?

@tgross
Copy link
Member

tgross commented Nov 11, 2024

The only way I can think to work around this would be nomad setting more restrictive permissions on the alloc directory itself (i.e. the one named after the job uid), e.g. setting ownership to match the podman socket owner, and 0700 permissions.
...
Alternatively, could this task driver bind-mount the alloc dir into some alternate path accessible by only the podman socket owner (e.g. beneath /run/user/UID), by bypass the more restrictive permission on the parent allocs dir?

In order for Nomad to match the podman socket owner, it would need to know there was a socket at all, which Nomad itself doesn't -- only the task driver has visibility into that kind of thing. So ultimately it would have to happen in the task driver. We have some precedence for having an alternate mount configuration for the recent exec2 driver. That driver has a different filesystem capability flag, which causes Nomad to create the alloc directory in alloc_mounts so that the driver can bind-mount it into the appropriate location. There might be promise in making that available to image-based file isolation as well.

@tgross tgross changed the title Multiple sockets, nomad client filesystem permissions rootless podman can't bind-mount allocdir Nov 11, 2024
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 11, 2024
@optiz0r
Copy link
Contributor Author

optiz0r commented Nov 12, 2024

alloc_mounts might work here as long as the landlock/unveil permissions apply to all processes, not just those spawned by the Alloc. The alloc_mounts dir would have to be world-traversable (OK provided every alloc is landlocked), and then the actual alloc dir only accessible to the user for whom the alloc is running.

I'm not that familiar with landlock, do the access grants apply to all processes started by a single uid, or does it apply to a process (tree). With the way the podman task driver works, with a rootful process reaching out to a socket to start the container, I'm not sure the latter will be viable. So even with alloc_mounts, Nomad core is probably still going to need to know which uid(s) should have access to the alloc_dir. That'll have to be communicated to Nomad somehow, either via additional parameters in the job spec (task.user has other side effects in the docker/podman drivers, and so cannot be reused for this), or communicated back from the task driver itself somehow.

Other complications: allocs can contain tasks using different task drivers or be run using different user accounts. A documented limitation that all tasks using this must run under the same driver/user would be fine for me at least.

@tgross
Copy link
Member

tgross commented Nov 12, 2024

Oh to be clear, I wasn't suggesting that we use Landlock for the podman driver. Landlock only locks out the process its being called from, so that doesn't really help. Just that having a separate source for the allocdir would allow for the following workflow:

  • Nomad creates the allocdir in the alloc_mounts dir.
  • The task driver bind-mounts that alloc_mounts dir into the standard allocdir location, setting permissions it knows about from the plugin config.
  • Then podman processes bind-mount from the standard allocdir location, using their own permissions.

Mind, this is all in my head and I haven't actually tried implementing any of it. 😁

@optiz0r
Copy link
Contributor Author

optiz0r commented Nov 13, 2024

Just that having a separate source for the allocdir would allow for the following workflow:

  • Nomad creates the allocdir in the alloc_mounts dir.
  • The task driver bind-mounts that alloc_mounts dir into the standard allocdir location, setting permissions it knows about from the plugin config.
  • Then podman processes bind-mount from the standard allocdir location, using their own permissions.

I'm not sure that works. If landlock is not being used, then the alloc_mounts dir needs to be just as protected as the normal nomad allocs dir, i.e. non-root should not be able to traverse through it. In which case neither is suitable for the rootless container. There can't be a single alternate allocs dir across all users, unless there's some external protection of some kind.

This task driver could bind-mount the alloc dir into a user-private location such as /run/user/UID. The driver does not currently understand unix identities for setting directory permissions, but could be extended to do so.

@tgross
Copy link
Member

tgross commented Nov 13, 2024

I'm not sure that works.
(snip)

Bah, yeah, you're right... this sort of thing has been one of the big barriers to a rootless Nomad client.

This task driver could bind-mount the alloc dir into a user-private location such as /run/user/UID. The driver does not currently understand unix identities for setting directory permissions, but could be extended to do so.

Sounds like the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants