A proposal for adding alpha support for seccomp to Kubernetes. Seccomp is a system call filtering facility in the Linux kernel which lets applications define limits on system calls they may make, and what should happen when system calls are made. Seccomp is used to reduce the attack surface available to applications.
Applications use seccomp to restrict the set of system calls they can make. Recently, container runtimes have begun adding features to allow the runtime to interact with seccomp on behalf of the application, which eliminates the need for applications to link against libseccomp directly. Adding support in the Kubernetes API for describing seccomp profiles will allow administrators greater control over the security of workloads running in Kubernetes.
Goals of this design:
- Describe how to reference seccomp profiles in containers that use them
This design should:
- build upon previous security context work
- be container-runtime agnostic
- allow use of custom profiles
- facilitate containerized applications that link directly to libseccomp
- As an administrator, I want to be able to grant access to a seccomp profile to a class of users
- As a user, I want to run an application with a seccomp profile similar to the default one provided by my container runtime
- As a user, I want to run an application which is already libseccomp-aware in a container, and for my application to manage interacting with seccomp unmediated by Kubernetes
- As a user, I want to be able to use a custom seccomp profile and use it with my containers
Controlling access to seccomp profiles is a cluster administrator concern. It should be possible for an administrator to control which users have access to which profiles.
The pod security policy API extension governs the ability of users to make requests that affect pod and container security contexts. The proposed design should deal with required changes to control access to new functionality.
Many users will want to use images that make assumptions about running in the context of their chosen container runtime. Such images are likely to frequently assume that they are running in the context of the container runtime's default seccomp settings. Therefore, it should be possible to express a seccomp profile similar to a container runtime's defaults.
As an example, all dockerhub 'official' images are compatible with the Docker default seccomp profile. So, any user who wanted to run one of these images with seccomp would want the default profile to be accessible.
Some applications already link to libseccomp and control seccomp directly. It should be possible to run these applications unmodified in Kubernetes; this implies there should be a way to disable seccomp control in Kubernetes for certain containers, or to run with a "no-op" or "unconfined" profile.
Sometimes, applications that link to seccomp can use the default profile for a container runtime, and restrict further on top of that. It is important to note here that in this case, applications can only place further restrictions on themselves. It is not possible to re-grant the ability of a process to make a system call once it has been removed with seccomp.
As an example, elasticsearch manages its own seccomp filters in its code.
Currently, elasticsearch is capable of running in the context of the default
Docker profile, but if in the future, elasticsearch needed to be able to call
ioperm
or iopr
(both of which are disallowed in the default profile), it
should be possible to run elasticsearch by delegating the seccomp controls to
the pod.
Different applications have different requirements for seccomp profiles; it should be possible to specify an arbitrary seccomp profile and use it in a container. This is more of a concern for applications which need a higher level of privilege than what is granted by the default profile for a cluster, since applications that want to restrict privileges further can always make additional calls in their own code.
An example of an application that requires the use of a syscall disallowed in
the Docker default profile is Chrome, which needs clone
to create a new user
namespace. Another example would be a program which uses ptrace
to
implement a sandbox for user-provided code, such as
eval.in.
Docker supports the open container initiative's API for seccomp, which is very close to the libseccomp API. It allows full specification of seccomp filters, with arguments, operators, and actions.
Docker allows the specification of a single seccomp filter. There are community requests for:
Issues:
- docker/22109: composable seccomp filters
- docker/21105: custom seccomp filters for builds
The rkt
runtime delegates to systemd for seccomp support; there is an open
issue to add support once appc
supports it. The appc
project has an open
issue to be able to describe seccomp as an isolator in an appc pod.
The systemd seccomp facility is based on a whitelist of system calls that can be made, rather than a full filter specification.
Issues:
HyperContainer does not support seccomp.
FreeBSD has a seccomp/capability-like facility called Capsicum.
lxd
constrains containers using a default profile.
Issues:
- lxd/1084: add knobs for seccomp
An earlier draft of this proposal described a new global API resource that could be used to describe seccomp profiles. After some discussion, it was determined that without a feedback signal from users indicating a need to describe new profiles in the Kubernetes API, it is not possible to know whether a new API resource is warranted.
That being the case, we will not propose a new API resource at this time. If there is strong community desire for such a resource, we may consider it in the future.
Instead of implementing a new API resource, we propose that pods be able to reference seccomp profiles by name. Since this is an alpha feature, we will use annotations instead of extending the API with new fields.
In the alpha version of this feature we will use annotations to store the names of seccomp profiles. The keys will be:
container.seccomp.security.alpha.kubernetes.io/<container name>
which will be used to set the seccomp profile of a container, and:
seccomp.security.alpha.kubernetes.io/pod
which will set the seccomp profile for the containers of an entire pod. If a pod-level annotation is present, and a container-level annotation present for a container, then the container-level profile takes precedence.
The value of these keys should be container-runtime agnostic. We will establish a format that expresses the conventions for distinguishing between an unconfined profile, the container runtime's default, or a custom profile. Since format of profile is likely to be runtime dependent, we will consider profiles to be opaque to kubernetes for now.
The following format is scoped as follows:
runtime/default
- the default profile for the container runtimeunconfined
- unconfined profile, ie, no seccomp sandboxinglocalhost/<profile-name>
- the profile installed to the node's local seccomp profile root
Since seccomp profile schemes may vary between container runtimes, we will treat the contents of profiles as opaque for now and avoid attempting to find a common way to describe them. It is up to the container runtime to be sensitive to the annotations proposed here and to interpret instructions about local profiles.
A new area on disk (which we will call the seccomp profile root) must be
established to hold seccomp profiles. A field will be added to the Kubelet
for the seccomp profile root and a knob (--seccomp-profile-root
) exposed to
allow admins to set it. If unset, it should default to the seccomp
subdirectory of the kubelet root directory.
The PodSecurityPolicy
type should be annotated with the allowed seccomp
profiles using the key
seccomp.security.alpha.kubernetes.io/allowedProfileNames
. The value of this
key should be a comma delimited list.
Here's an example of a pod that uses the unconfined profile:
apiVersion: v1
kind: Pod
metadata:
name: trustworthy-pod
annotations:
seccomp.security.alpha.kubernetes.io/pod: unconfined
spec:
containers:
- name: trustworthy-container
image: sotrustworthy:latest
Here's an example of a pod that uses a profile called example-explorer- profile
using the container-level annotation:
apiVersion: v1
kind: Pod
metadata:
name: explorer
annotations:
container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile
spec:
containers:
- name: explorer
image: gcr.io/google_containers/explorer:1.0
args: ["-port=8080"]
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- mountPath: "/mount/test-volume"
name: test-volume
volumes:
- name: test-volume
emptyDir: {}