-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seccomp: initial small additions to the rules #2040
Conversation
These have been replaced by the rt_sigaction family, and have not been compiled in on most kernels since linux v3.9 (2013) Signed-off-by: Dominique Martinet <[email protected]>
timerfd as a syscall does not seem to have ever existed, remove it from allowed syscalls list. Signed-off-by: Dominique Martinet <[email protected]>
The following syscalls have been added in recent kernels and considered for this list: - cachestat, prints information about cache misses; it is less accurate than userfaultfd so probably safe but deny it until a clear need shows up - io_pgetevents_time64: io_pgetevents is already blocked, so block this variant as well. Note these are pretty close to io_getenvents, so we should probably block that as well, but since it is currently allowed keep that where it is. - map_shadow_stack: this allows creating a new shadow stack, required for user-space threading if shadow stack verification is enabled (prctl PR_SET_SHADOW_STACK_STATUS with PR_SHADOW_STACK_ENABLE); this might be required in the future but delay this decision until someone requests it - futex_* new interface is primarily intended for io_uring which we disallow, and does not have any known user yet so likewise block until someone requests it. - quotactl_fd: this is identical to quotactl, so only allow for SYS_ADMIN like quotactl. Signed-off-by: Dominique Martinet <[email protected]>
swapcontext seems to be used for coroutines in some languages (at least ruby), enough to have been added to other major engines by an actual user. Link: moby/moby#43092 Link: systemd/systemd#9487 Link: containerd/containerd#6411 Signed-off-by: Dominique Martinet <[email protected]>
apparently harmless and used Link: systemd/systemd#25018 Link: containerd/containerd#6882 Link: moby/moby#43553 Signed-off-by: Dominique Martinet <[email protected]>
@giuseppe PTAL |
pkg/seccomp/default_linux.go
Outdated
@@ -293,7 +294,6 @@ func DefaultProfile() *Seccomp { | |||
"pidfd_send_signal", | |||
"pipe", | |||
"pipe2", | |||
"pivot_root", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't pivot_root used by container runtimes so this breaks the nested container case I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we need this otherwise we break nested containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, sorry I should have tested this, I was somehow convinced it wasn't used. I've removed that commit.
pkg/seccomp/default_linux.go
Outdated
Names: []string{ | ||
"reboot", | ||
}, | ||
Action: ActAllow, | ||
Args: []*Arg{}, | ||
Includes: Filter{ | ||
Caps: []string{"CAP_SYS_BOOT"}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this additional check in the profile since it is already in place in the kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point of these is to stack layers so if the kernel gets a check wrong we had this safety, but I just had a look at the reboot syscall definition and checking the capability is literally the first thing done so there isn't much room for mistake here;
ok, removing this.
"io_pgetevents", | ||
"io_pgetevents_time64", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets block io_getenvents,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we block io_getevents too I think programs using libaio will stop working; this is just listing a syscall that was already blocked.
I agree the hardened profile can definitely block the io_{setup,destroy,cancel,getevents,submit} family of syscalls like we have io_uring blocked, but I'm not sure we should block them here.
Perhaps for podman 6.0 defaults?
Great work @martinetd Thanks for this effort |
/approve |
the readdir syscall hasn't existed forever (wasn't present in linux 2.6.12 initial import into git), remove it and don't even bother adding it to the list of EPERM syscalls Signed-off-by: Dominique Martinet <[email protected]>
This does not deny anything new (bpf is currently allowed for sys admin) Signed-off-by: Dominique Martinet <[email protected]>
This doesn't deny anything new (perf_event_open is currently allowed for SYS_ADMIN) Signed-off-by: Dominique Martinet <[email protected]>
syscall() emulates all other syscalls, so having this allowed makes no sense as far as seccomp filters go. This is a breaking change, but this probably will not break much. Signed-off-by: Dominique Martinet <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews, I've replied/updated as pointed out.
I'll do some more tests (including podman in podman!) with this later today, but I think it should be good to go.
pkg/seccomp/default_linux.go
Outdated
@@ -293,7 +294,6 @@ func DefaultProfile() *Seccomp { | |||
"pidfd_send_signal", | |||
"pipe", | |||
"pipe2", | |||
"pivot_root", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, sorry I should have tested this, I was somehow convinced it wasn't used. I've removed that commit.
pkg/seccomp/default_linux.go
Outdated
Names: []string{ | ||
"reboot", | ||
}, | ||
Action: ActAllow, | ||
Args: []*Arg{}, | ||
Includes: Filter{ | ||
Caps: []string{"CAP_SYS_BOOT"}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point of these is to stack layers so if the kernel gets a check wrong we had this safety, but I just had a look at the reboot syscall definition and checking the capability is literally the first thing done so there isn't much room for mistake here;
ok, removing this.
"io_pgetevents", | ||
"io_pgetevents_time64", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we block io_getevents too I think programs using libaio will stop working; this is just listing a syscall that was already blocked.
I agree the hardened profile can definitely block the io_{setup,destroy,cancel,getevents,submit} family of syscalls like we have io_uring blocked, but I'm not sure we should block them here.
Perhaps for podman 6.0 defaults?
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, martinetd, rhatdan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is the first part of the seccomp policy work described in #1988: the small things that probably won't be too controversial -- sorry for the delay.
I have finished the rest all the way to unshare/mount blocked as per docker policies in my seccomp-hardened branch, this one has a few more restrictions that are in place on docker so probably would not cause much trouble but still are more restrictive, so left out for now (that branch just modified the same filter, but it'll be a separate file when ready to open as a PR; I was just using it to test ; I've noticed a few more syscalls I'll want to look at closer and maybe block so there's a bit of work left first)
Each commit is small and self-contained, so please refer to commit messages for individual changes -- most of the new blockages are obsolete or never existed from what I can tell; the exceptions that might be worth discussing are:
syscall(__NR_foo, ..)
will still work, it blockssyscall(__NR_syscall, __NR_foo, ..)
)The rest should be strictly more permissive than currently unless I missed something.