runc 1.2.0-rc.1 -- "There's a frood who really knows where his towel is."
Pre-releaseThis is the first release candidate for the 1.2.0 branch of runc. It includes
all patches and bugfixes included in runc 1.1 patch releases (up to and
including 1.1.12). A fair few new features have been added, and some changes
have been made which may affect users. Please help us thoroughly test this
release before we release 1.2.0.
runc
now requires a minimum of Go 1.20 to compile.
NOTE: runc currently will not work properly when compiled with Go 1.22 or
newer. This is due to some unfortunate glibc behaviour that Go 1.22
exacerbates in a way that results in containers not being able to start on
some systems. See this issue for more information.
Breaking
-
Several aspects of how mount options work has been adjusted in a way that
could theoretically break users that have very strange mount option strings.
This was necessary to fix glaring issues in how mount options were being
treated. The key changes are:-
Mount options on bind-mounts that clear a mount flag are now always
applied. Previously, if a user requested a bind-mount with only clearing
options (such asrw,exec,dev
) the options would be ignored and the
original bind-mount options would be set. Unfortunately this also means
that container configurations which specified only clearing mount options
will now actually get what they asked for, which could break existing
containers (though it seems unlikely that a user who requested a specific
mount option would consider it "broken" to get the mount options they
asked foruser who requested a specific mount option would consider it
"broken" to get the mount options they asked for). This also allows us to
silently add locked mount flags the user did not explicitly request to be
cleared in rootless mode, allowing for easier use of bind-mounts for
rootless containers. (#3967) -
Container configurations using bind-mounts with superblock mount flags
(i.e. filesystem-specific mount flags, referred to as "data" in
mount(2)
, as opposed to VFS generic mount flags likeMS_NODEV
) will
now return an error. This is because superblock mount flags will also
affect the host mount (as the superblock is shared when bind-mounting),
which is obviously not acceptable. Previously, these flags were silently
ignored so this change simply tells users that runc cannot fulfil their
request rather than just ignoring it. (#3990)
If any of these changes cause problems in real-world workloads, please open
an issue so we
can adjust the behaviour to avoid compatibility issues. -
Added
- runc has been updated to OCI runtime-spec 1.2.0, and supports all Linux
features with a few minor exceptions. See
docs/spec-conformance.md
for more details. - runc now supports id-mapped mounts for bind-mounts (with no restrictions on
the mapping used for each mount). Other mount types are not currently
supported. This feature requiresMOUNT_ATTR_IDMAP
kernel support (Linux
5.12 or newer) as well as kernel support for the underlying filesystem used
for the bind-mount. Seemount_setattr(2)
for a list of
supported filesystems and other restrictions. (#3717, #3985, #3993) - Two new mechanisms for reducing the memory usage of our protections against
CVE-2019-5736 have been introduced:runc-dmz
is a minimal binary (~8K) which acts as an additional execve
stage, allowing us to only need to protect the smaller binary. It should
be noted that there have been several compatibility issues reported with
the usage ofrunc-dmz
(namely related to capabilities and SELinux). As
such, this mechanism is opt-in and can be enabled by runningrunc
with the environment variableRUNC_DMZ=true
(setting this environment
variable inconfig.json
will have no effect). This feature can be
disabled at build time using therunc_nodmz
build tag. (#3983, #3987)contrib/memfd-bind
is a helper daemon which will bind-mount a memfd copy
of/usr/bin/runc
on top of/usr/bin/runc
. This entirely eliminates
per-container copies of the binary, but requires care to ensure that
upgrades to runc are handled properly, and requires a long-running daemon
(unfortunately memfds cannot be bind-mounted directly and thus require a
daemon to keep them alive). (#3987)
- runc will now use
cgroup.kill
if available to kill all processes in a
container (such as when doingrunc kill
). (#3135, #3825) - Add support for setting the umask for
runc exec
. (#3661) - libct/cg: support
SCHED_IDLE
for runc cgroupfs. (#3377) - checkpoint/restore: implement
--manage-cgroups-mode=ignore
. (#3546) - seccomp: refactor flags support; add flags to features, set
SPEC_ALLOW
by
default. (#3588) - libct/cg/sd: use systemd v240+ new
MAJOR:*
syntax. (#3843) - Support CFS bandwidth burst for CPU. (#3749, #3145)
- Support time namespaces. (#3876)
- Reduce the
runc
binary size by ~11% by updating
github.com/checkpoint-restore/go-criu
. (#3652) - Add
--pidfd-socket
torunc run
andrunc exec
to allow for management
processes to receive a pidfd for the new process, allowing them to avoid pid
reuse attacks. (#4045)
Deprecated
runc
option--criu
is now ignored (with a warning), and the option will
be removed entirely in a future release. Users who need a non-standard
criu
binary should rely on the standard way of looking up binaries in
$PATH
. (#3316)runc kill
option-a
is now deprecated. Previously, it had to be specified
to kill a container (with SIGKILL) which does not have its own private PID
namespace (so that runc would send SIGKILL to all processes). Now, this is
done automatically. (#3864, #3825)github.com/opencontainers/runc/libcontainer/user
is now deprecated, please
usegithub.com/moby/sys/user
instead. It will be removed in a future
release. (#4017)
Changed
- When Intel RDT feature is not available, its initialization is skipped,
resulting in slightly fasterrunc exec
andrunc run
. (#3306) runc features
is no longer experimental. (#3861)- libcontainer users that create and kill containers from a daemon process
(so that the container init is a child of that process) must now implement
a proper child reaper in case a container does not have its own private PID
namespace, as documented incontainer.Signal
. (#3825) - Sum
anon
andfile
frommemory.stat
for cgroupv2 root usage,
as the root does not havememory.current
for cgroupv2.
This aligns cgroupv2 root usage more closely with cgroupv1 reporting.
Additionally, report root swap usage as sum of swap and memory usage,
aligned with v1 and existing non-root v2 reporting. (#3933) - Add
swapOnlyUsage
inMemoryStats
. This field reports swap-only usage.
For cgroupv1,Usage
andFailcnt
are set by subtracting memory usage
from memory+swap usage. For cgroupv2,Usage
,Limit
, andMaxUsage
are set. (#4010) - libcontainer users that create and kill containers from a daemon process
(so that the container init is a child of that process) must now implement
a proper child reaper in case a container does not have its own private PID
namespace, as documented incontainer.Signal
. (#3825) - libcontainer:
container.Signal
no longer takes anall
argument. Whether
or not it is necessary to kill all processes in the container individually
is now determined automatically. (#3825, #3885) - seccomp: enable seccomp binary tree optimization. (#3405)
runc run
/runc exec
: ignore SIGURG. (#3368)- Remove tun/tap from the default device allowlist. (#3468)
runc --root non-existent-dir list
now reports an error for non-existent
root directory. (#3374)
Fixed
- In case the runc binary resides on tmpfs,
runc init
no longer re-execs
itself twice. (#3342) - Our seccomp
-ENOSYS
stub now correctly handles multiplexed syscalls on
s390 and s390x. This solves the issue where syscalls the host kernel did not
support would return-EPERM
despite the existence of the-ENOSYS
stub
code (this was due to how s390x does syscall multiplexing). (#3474) - Remove tun/tap from the default device rules. (#3468)
- specconv: avoid mapping "acl" to
MS_POSIXACL
. (#3739) - libcontainer: fix private PID namespace detection when killing the
container. (#3866, #3825) - systemd socket notification: fix race where runc exited before systemd
properly handled theREADY
notification. (#3291, #3293) - The
-ENOSYS
seccomp stub is now always generated for the native
architecture thatrunc
is running on. This is needed to work around some
arguably specification-incompliant behaviour from Docker on architectures
such as ppc64le, where the allowed architecture list is set tonull
. This
ensures that we always generate at least one-ENOSYS
stub for the native
architecture even with these weird configs. (#4219)
Removed
- In order to fix performance issues in the "lightweight" bindfd protection
against CVE-2019-5736, the temporaryro
bind-mount of
/proc/self/exe
has been removed. runc now creates a binary copy in all
cases. See the above notes aboutmemfd-bind
andrunc-dmz
as well as
contrib/cmd/memfd-bind/README.md
for more information about how this
(minor) change in memory usage can be further reduced. (#3987, #3599, #2532,
#3931) - libct/cg: Remove
EnterPid
(a function with no users). (#3797) - libcontainer: Remove
{Pre,Post}MountCmds
which were never used and are
obsoleted by more generic container hooks. (#3350)
Static Linking Notices
The runc
binary distributed with this release are statically linked with
the following GNU LGPL-2.1 licensed libraries, with runc
acting
as a "work that uses the Library":
The versions of these libraries were not modified from their upstream versions,
but in order to comply with the LGPL-2.1 (§6(a)), we have attached the
complete source code for those libraries which (when combined with the attached
runc source code) may be used to exercise your rights under the LGPL-2.1.
However we strongly suggest that you make use of your distribution's packages
or download them from the authoritative upstream sources, especially since
these libraries are related to the security of your containers.
Thanks to the following contributors who made this release possible:
- Akihiro Suda [email protected]
- Alban Crequy [email protected]
- Aleksa Sarai [email protected]
- Alex Jia [email protected]
- Alexander Eldeib [email protected]
- Andrey Tsygunka [email protected]
- Austin Vazquez [email protected]
- Bjorn Neergaard [email protected]
- Brian Goff [email protected]
- Chengen, Du [email protected]
- Chethan Suresh [email protected]
- Christian Happ [email protected]
- Cory Snider [email protected]
- CrazyMax [email protected]
- Daniel, Dao Quang Minh [email protected]
- Danish Prakash [email protected]
- Davanum Srinivas [email protected]
- Eng Zer Jun [email protected]
- Eric Ernst [email protected]
- Erik Sjölund [email protected]
- Evan Phoenix [email protected]
- Francis Laniel [email protected]
- Heran Yang [email protected]
- Irwin D'Souza [email protected]
- Jaroslav Jindrak [email protected]
- Jonas Eschenburg [email protected]
- Jordan Rife [email protected]
- Kailun Qin [email protected]
- Kang Chen [email protected]
- Kazuki Hasegawa [email protected]
- Kir Kolyshkin [email protected]
- Markus Lehtonen [email protected]
- Masahiro Yamada [email protected]
- Mikko Ylinen [email protected]
- Mrunal Patel [email protected]
- Peter Hunt [email protected]
- Prajwal S N [email protected]
- Qiang Huang [email protected]
- Radostin Stoyanov [email protected]
- Rodrigo Campos [email protected]
- Ruediger Pluem [email protected]
- Sebastiaan van Stijn [email protected]
- Shengjing Zhu [email protected]
- Sjoerd van Leent [email protected]
- SuperQ [email protected]
- TTFISH [email protected]
- Tianon Gravi [email protected]
- Vipul Newaskar [email protected]
- Walt Chen [email protected]
- Wang-squirrel [email protected]
- Wei Fu [email protected]
- Zheao Li [email protected]
- Zoe [email protected]
- cdoern [email protected]
- dharmicksai [email protected]
- guodong [email protected]
- hang.jiang [email protected]
- lengrongfu [email protected]
- lifubang [email protected]
- utam0k [email protected]
- wineway [email protected]
- yanggang [email protected]
- yaozhenxiu [email protected]
Signed-off-by: Aleksa Sarai [email protected]