This project is a fork of the Firecracker Jailer, which is a program designed to isolate the Cloud-Hypervisor. This documentation come from the original Firecracker Jailer documentation, with some modifications. You can see the original Firecracker Jailer documentation here.
The jailer is a program designed to isolate the Cloud-Hypervisor process in order to enhance Cloud-Hypervisor's security posture. It is meant to address the security needs of Cloud-Hypervisor only and is not intended to work with other binaries. Additionally, each jailer binary should be used with a statically linked Cloud-Hypervisor binary (with the default musl toolchain) of the same version. Experimental gnu builds are not supported.
The jailer is invoked in this manner:
jailer --id <id> \
--exec-file <exec_file> \
--uid <uid> \
--gid <gid> \
[--parent-cgroup <relative_path>]
[--cgroup <cgroup>]
[--chroot-base-dir <chroot_base>]
[--netns <netns>]
[--resource-limit <resource=value>]
[--daemonize]
[--new-pid-ns]
--
[...extra arguments to pass to Cloud-Hypervisor]
id
is the unique VM identification string, which may contain alphanumeric characters and hyphens. The maximumid
length is currently 64 characters.exec_file
is the path to the Cloud-Hypervisor binary that will be exec-ed by the jailer. The filename must include the stringcloud-hypervisor
. This is enforced because the interaction with the jailer is Cloud-Hypervisor specific.uid
andgid
are the uid and gid the jailer switches to as it execs the target binary.parent-cgroup
is used to allow the placement of microvm cgroups in custom nested hierarchies. By specifying this parameter, the jailer will create a new cgroup namedid
for the microvm in the<cgroup_base>/<parent_cgroup>
subfolder.cgroup_base
is the unified controller hierarchy forcgroup v2
( e.g./sys/fs/cgroup/unified
.<parent_cgroup>
is a relative path within that hierarchy. For example, if--parent-cgroup all_uvms/external_uvms
is specified, the jailer will write all cgroup parameters specified through--cgroup
in/sys/fs/cgroup/<controller_name>/all_uvms/external_uvms/<id>
. By default, the parent cgroup isexec-file
. If there are no--cgroup
parameters specified then the jailer will move the process to the specified cgroup.cgroup
cgroups can be passed to the jailer to let it set the values when the microVM process is spawned. The--cgroup
argument must follow this format:<cgroup_file>=<value>
(e.gcpuset.cpus=0
). This argument can be used multiple times to set multiple cgroups. This is useful to avoid providing privileged permissions to another process for setting the cgroups before or after the jailer is executed. The--cgroup
flag can help as well to set Cloud-Hypervisor process cgroups before the VM starts running, with no need to create the entire cgroup hierarchy manually (which requires privileged permissions).chroot_base
represents the base folder where chroot jails are built. The default is/srv/jailer
.netns
represents the path to a network namespace handle. If present, the jailer will use this to join the associated network namespace.- For extra security and control over resource usage,
resource-limit
can be used to set bounds to the process resources. The--resource-limit
argument must follow this format:<resource>=<value>
(e.gno-file=1024
) and can be used multiple times to set multiple bounds. Current available resources that can be limited using this argument are:fsize
: The maximum size in bytes for files created by the process.no-file
: Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
Here is an example on how to set multiple resource limits using this argument:
--resource-limit fsize=250000000 --resource-limit no-file=1024
- When present, the
--daemonize
flag causes the jailer to callsetsid()
and redirect all three standard I/O file descriptors to/dev/null
. - When present, the
--new-pid-ns
flag causes the jailer to spawn the provided binary into a new PID namespace. It makes use of the libcclone()
function with theCLONE_NEWPID
flag. As a result, the jailer and the process running the exec file have different PIDs. The PID of the child process is stored in the jail root directory inside<exec_file_name>.pid
. - The jailer adheres to the "end of command options" convention, meaning all
parameters specified after
--
are forwarded to Cloud-Hypervisor. For example, this can be paired with the--kernel
Cloud-Hypervisor argument to specify a the used kernel or firmware when starting Cloud-Hypervisor via the jailer (the file path and the resources referenced within must be valid relative to a jailed Cloud-Hypervisor).
After starting, the Jailer goes through the following operations:
- Validate all provided paths and the VM
id
. - Close all open file descriptors based on
/proc/<jailer-pid>/fd
except input, output and error. - Cleanup all environment variables received from the parent process.
- Create the
<chroot_base>/<exec_file_name>/<id>/root
folder, which will be henceforth referred to aschroot_dir
.exec_file_name
is the last path component ofexec_file
(for example, that would becloud-hypervisor
for/usr/bin/cloud-hypervisor
). Nothing is done if the path already exists (it should not, sinceid
is supposed to be unique). - Copy
exec_file
to<chroot_base>/<exec_file_name>/<id>/root/<exec_file_name>
. This ensures the new process will not share memory with any other Cloud-Hypervisor process. - Set resource bounds for current process and its children through
--resource-limit
argument, by callingsetrlimit()
system call with the specific resource argument. If no limits are provided, the jailer boundsno-file
to a maximum default value of 2048. - Create the
cgroup
sub-folders. The jailer can only usecgroup v2
. On most systems, this is mounted by default in/sys/fs/cgroup
(should be mounted by the user otherwise). The jailer will parse/proc/mounts
to detect where each of the controllers required in--cgroup
can be found (multiple controllers may share the same path). For each identified location (referred to as<cgroup_base>
), the jailer creates the<cgroup_base>/<parent_cgroup>/<id>
subfolder, and writes the current pid to<cgroup_base>/<parent_cgroup>/<id>/tasks
. Also, the value passed for each<cgroup_file>
is written to the file. If--node
is used the corresponding values are written to the appropriatecpuset.mems
andcpuset.cpus
files. - Call
unshare()
into a new mount namespace, usepivot_root()
to switch the old system root mount point with a new one base inchroot_dir
, switch the current working directory to the new root, unmount the old root mount point, and callchroot
into the current directory. - Use
mknod
to create a/dev/net/tun
equivalent inside the jail. - Use
mknod
to create a/dev/kvm
equivalent inside the jail. - Use
chown
to change ownership of thechroot_dir
(root path/
as seen by the jailed cloud-hypervisor),/dev/net/tun
,/dev/kvm
. The ownership is changed to the provideduid:gid
. - If
--netns <netns>
is present, attempt to join the specified network namespace. - If
--daemonize
is specified, callsetsid()
and redirectSTDIN
,STDOUT
, andSTDERR
to/dev/null
. - If
--new-pid-ns
is specified, callclone()
withCLONE_NEWPID
flag to spawn a new process within a new PID namespace. The new process will assume the role of init(1) in the new namespace. The parent will store child's PID inside<exec_file_name>.pid
, while the child drops privileges andexec()
s into the<exec_file_name>
, as described below. - Drop privileges via setting the provided
uid
andgid
. - Exec into
<exec_file_name>
and forward any extra arguments provided to the jailer after--
, as mentioned in the Jailer Usage section.
Let’s assume Cloud-Hypervisor is available as /usr/bin/cloud-hypervisor
, and the jailer
can be found at /usr/bin/jailer
. We pick the unique id
e22ohyz7v9h7pwsxhxz3skfj, and we choose to run on NUMA node 0
(in order to isolate the process in the 0th NUMA node we need to set
cpuset.mems=0
and cpuset.cpus
equals to the CPUs of that NUMA node), using
uid 123, and gid 100. For this example, we are content with the default
/srv/jailer
chroot base dir.
We start by running:
/usr/bin/jailer --id e22ohyz7v9h7pwsxhxz3skfj
--cgroup cpuset.mems=0 --cgroup cpuset.cpus=$(cat /sys/devices/system/node/node0/cpulist)
--exec-file /usr/bin/cloud-hypervisor --uid 123 --gid 100 \
--netns /var/run/netns/my_netns --daemonize
-- --api-sock /ch-api.sock
After opening the file descriptors mentioned in the previous section, the jailer will create the following resources (and all their prerequisites, such as the path which contains them):
/srv/jailer/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj/root/cloud-hypervisor
(copied from/usr/bin/cloud-hypervisor
)
We are going to refer to
/srv/jailer/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj/root
as
<chroot_dir>
.
Let’s also assume the, cpuset cgroups are mounted at
/sys/fs/cgroup/cpuset
. The jailer will create the following subfolder (which
will inherit settings from the parent cgroup):
/sys/fs/git push -u origin main/cpuset/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj
It’s worth noting that, whenever a folder already exists, nothing will be done,
and we move on to the next directory that needs to be created. This should only
happen for the common cloud-hypervisor
subfolder (but, as for creating the chroot
path before, we do not issue an error if folders directly associated with the
supposedly unique id
already exist).
The jailer then writes the current pid to
/sys/fs/cgroup/cpuset/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj/tasks
,
It also writes 0
to
/sys/fs/cgroup/cpuset/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj/cpuset.mems
,
And the corresponding CPUs to
/sys/fs/cgroup/cpuset/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj/cpuset.cpus
.
Since the --netns
parameter is specified in our example, the jailer opens
/var/run/netns/my_netns
to get a file descriptor fd
, uses
setns(fd, CLONE_NEWNET)
to join the associated network namespace, and then
closes fd
.
The --daemonize
flag is also present, so the jailers opens /dev/null
as
RW and keeps the associate file descriptor as dev_null_fd
(we do this
before going inside the jail), to be used later.
Build the chroot jail. First, the jailer uses unshare()
to enter a new mount
namespace, and changes the propagation of all mount points in the new namespace
to private using mount(NULL, “/”, NULL, MS_PRIVATE | MS_REC, NULL)
, as a
prerequisite to pivot_root()
. Another required operation is to bind mount
<chroot_dir>
on top of itself using
mount(<chroot_dir>, <chroot_dir>, NULL, MS_BIND, NULL)
. At this point, the
jailer creates the folder <chroot_dir>/old_root
, changes the current directory
to <chroot_dir>
, and calls syscall(SYS_pivot_root, “.”, “old_root”)
. The
final steps of building the jail are unmounting old_root
using
umount2(“old_root”, MNT_DETACH)
, deleting old_root
with rmdir
, and finally
calling chroot(“.”)
for good measure. From now, the process is jailed in
<chroot_dir>
.
Create the special file /dev/net/tun
, using
mknod(“/dev/net/tun”, S_IFCHR | S_IRUSR | S_IWUSR, makedev(10, 200))
, and then
call chown(“/dev/net/tun”, 123, 100)
, so Cloud-Hypervisor can use it after dropping
privileges. This is required to use multiple TAP interfaces when running jailed.
Do the same for /dev/kvm
.
Change ownership of <chroot_dir>
to uid:gid
so that Cloud-Hypervisor can create
its API socket there.
Since the --daemonize
flag is present, call setsid()
to join a new session,
a new process group, and to detach from the controlling terminal. Then, redirect
standard file descriptors to /dev/null
by calling dup2(dev_null_fd, STDIN)
,
dup2(dev_null_fd, STDOUT)
, and dup2(dev_null_fd, STDERR)
. Close
dev_null_fd
, because it is no longer necessary.
Finally, the jailer switches the uid
to 123
, and gid
to 100
, and execs
./cloud-hypervisor \
--api-sock /ch-api.sock
Now cloud-hypervisor creates the socket at
/srv/jailer/cloud-hypervisor/e22ohyz7v9h7pwsxhxz3skfj/root/ch-api.sock
to interact with the VM.
- The user must create hard links for (or copy) any resources which will be provided to the VM via the API (disk images, kernel images, named pipes, etc) inside the jailed root folder. Also, permissions must be properly managed for these resources; for example the user which Cloud-Hypervisor runs as must have both read and write permissions to the backing file for a RW block device.
- By default the VMs are not asigned to any NUMA node or pinned to any CPU. The
user must manage any fine tuning of resource partitioning via cgroups, by
using the
--cgroup
command line argument. - It’s up to the user to handle cleanup after running the jailer. One way to do
this involves registering handlers with the cgroup
notify_on_release
mechanism, while being wary about potential race conditions (the instance crashing before the subscription process is complete, for example). - For extra resilience, the
--new-pid-ns
flag enables the Jailer to exec the binary file in a new PID namespace, in order to become a pseudo-init process. Alternatively, the user can spawn the jailer in a new PID namespace via a combination ofclone()
with theCLONE_NEWPID
flag andexec()
. - We run the jailer as the
root
user; it actually requires a more restricted set of capabilities, but that's to be determined as features stabilize. - The jailer can only log messages to stdout/err for now, which is why the logic
associated with
--daemonize
runs towards the end, instead of the very beginning. We are working on adding better logging capabilities.
- When passing the --daemonize option to Cloud-Hypervisor without the --new-ns-pid
option, the Cloud-Hypervisor process will have a different PID than the Jailer
process and killing the Jailer will not kill the Cloud-Hypervisor process. As a
workaround to get Cloud-Hypervisor PID, the Jailer stores the PID of the child
process in the jail root directory inside
<exec_file_name>.pid
for all cases regardless of whether--new-pid-ns
was provided. The suggested way to fetch Cloud-Hypervisor's PID when using the Jailer is to read thecloud-hypervisor.pid
file present in the Jailer's root directory.
- If all the cgroup controllers are bunched up on a single mount point using the "all" option, our current program logic will complain it cannot detect individual controller mount points.