-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What does "Pod Sandbox" mean to Aurae? #433
Comments
Hi, Just to add a tiny detail, in case that's helpful, about this:
You can "save" a namespace by bind-mounting its pseudo-file (the thing found in For instance, let's create a namespace and configure a loopback interface in it, with a specific IP address so we can identify it later:
Now, "save" the namespace by bind-mounting it:
Leave the namespace, check that we're "outside" and that the process that we created is gone:
Then re-enter that namespace thanks to the bind-mount of the pseudo-file:
I don't know for sure why Kubernetes is using the "pod sandbox" concept. Perhaps because the Docker API doesn't expose anything to manipulate these network namespaces directly, and these pause containers were as good as anything else to do that. And/or perhaps the pause containers are a place as good as any to do zombie reaping (when sharing the PID namespace). I don't have a particularly strong opinion on your question, though (I'm just watching the space from the sidelines with excitement :)) Edit: I was mentioning the bind-mount possibility just in case that unlocks interesting options in your scenario, i.e. the possibility of preserving a network namespace on its own; without requiring a process - which itself might require a Cell or leak some other abstraction. |
You probably already know about this, but if you plan to go the VM route, https://github.com/rust-vmm might be useful. |
Thank you @jpetazzo and @gabriel-samfira! This is great feedback! @jpetazzo this is useful context, I had no idea you could "park" a namespace just by bind mounting the pseudo namespace file. This will surely come in handy one day 😉 and I also suspect this will be one of those issues that serves as hidden knowledge that folks will discover while looking for examples of how to save a namespace. The only way we will know how many people find this useful in the coming years will be by folks leaving emojis on the thread to show us. As far the decision goes I am pretty convinced on Option 1 and will be looking at this more on Twitch today. Maybe a better set of questions:
|
Okay so we are going to perform a small experiment to validate my theory that we can run Option 1) as a default and fall back to Option 3) HypothesisI believe it should be possible for 2 containers to share a namespace (specifically a network namespace) with Youki without making any changes to the code. I also believe we should be able to re-create the "Kubernetes Pod" experience directly in a VM with lightweight containers and some basic understanding of how chroot works. |
Results of the experimentI was able to run an Expand details for raw
{
"ociVersion": "1.0.2-dev",
"root": {
"path": "rootfs",
"readonly": false
},
"mounts": [
{
"destination": "/var/log",
"type": "bind",
"source": "/var/log",
"options": [
"rbind",
"rw"
]
},
{
"destination": "/tmp",
"type": "tmpfs",
"source": "tmpfs"
},
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=65536k"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"nosuid",
"noexec",
"nodev",
"relatime",
"ro"
]
}
],
"process": {
"terminal": false,
"user": {
"uid": 0,
"gid": 0
},
"args": [
"nginx",
"-g",
"daemon off;"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm"
],
"cwd": "/",
"capabilities": {
"bounding": [
"CAP_SETUID",
"CAP_SETGID",
"CAP_CHOWN",
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"effective": [
"CAP_SETUID",
"CAP_SETGID",
"CAP_CHOWN",
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"inheritable": [
"CAP_SETUID",
"CAP_CHOWN",
"CAP_SETGID",
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"permitted": [
"CAP_SETUID",
"CAP_SETGID",
"CAP_CHOWN",
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"ambient": [
"CAP_SETUID",
"CAP_SETGID",
"CAP_CHOWN",
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
]
},
"rlimits": [
{
"type": "RLIMIT_NOFILE",
"hard": 1024,
"soft": 1024
}
],
"noNewPrivileges": true
},
"hostname": "nginx",
"annotations": {},
"linux": {
"resources": {
"devices": [
{
"allow": true,
"type": null,
"major": null,
"minor": null,
"access": "rwm"
}
]
},
"namespaces": [
{
"type": "pid"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
],
"maskedPaths": [
"/proc/acpi",
"/proc/asound",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/sys/firmware",
"/proc/scsi"
],
"readonlyPaths": [
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
}
In this experiment I was able to share the network namespace with the host by removing the configuration from ProcedureRun the nginx container with youki, and verify the experiment. mkdir nginx
cd nginx
# Copy the config.json above to here
mkdir rootfs
sudo -E docker create --name nginx nginx
sudo -E docker export nginx | sudo -E tar -C rootfs -xf -
sudo -E youki run -b . nginx
netstat -tlpn | grep "80" # Verify that nginx is communicated on the host namespace
tail -f /var/log/nginx/* # Verify that the bind mount is working with the host
emacs rootfs/usr/share/nginx/html/index.html # Edit the nginx hello world and view localhost:80 Then cleanup with youki sudo -E youki kill nginx SIGKILL
sudo -E youki delete nginx |
The result of the experiment has me convinced of our approach moving forward. DecisionI am making the decision to pursue Option 1. All Aurae Pod Sandboxes will run as a lightweight virtual machine with
In the event that virtualization is not available we fallback on the "flat container" model described in Option 3. ImplicationsEach pod gets its own kernel. Each pod gets its own set of network devices. Each pod gets its own guest |
What's the reasoning behind this? Do you want to enable applications to interact with auraed? The alternative would be to create a separate namespace within the VM for the |
The main motivator is that this what Kubernetes does today. See the CC-BY-SA licensed diagram here: and the referenced documentation for pod networking.
In my experience the network namespace is the "big one" that really matters for a pod. Kubernetes has always maintained that a pod should share local storage and local network. Container volumes make the storage discussion pretty simple as pods just mount volumes between each other, but the network namespace sharing is key for pods to be able to do things like run sidecars. Should an application interact with the Aurae daemon?As far as applications interacting with Auraed I think the answer is yes. I think it's too early on the project to say exactly what an application will use Auraed for specifically. However I know enough about infrastructure, sidecars, and platforms to know that most app teams will want the basics (secrets, service discovery, etc). I think Aurae attempts to simplify a lot of these discussions by bringing small features into scope. As far as ports being eaten up in the same network namespace, yes that will be a consequence and is exactly why we have the TCP port situation we do today in Kube with pods needing to manage ports in some cases. I think this is the right thing to do, however I'm open to having my mind changed. Do we want to use eBPF to bridge the network across the namespace?Now networking on the pod with eBPF -- while exciting -- I think it's wrong to add too much magic there unless absolutely critically necessary. The whole point of Aurae is to be secure and boring, and sharing a network namespace in a simple and boring way without having to manage nested eBPF probes seems like the way to go. I am very traumatized from Kubernetes CNI and I don't want to go down the path of making the world more complex in exchange for some flexibility. I think a much more interesting conversation to have is to admit that the pod sandbox boundary is the network boundary in a pod, and start talking about how to map Linux network devices to the pod. |
You could reach auraed from within the Pod via e.g. |
This is an interesting topology. I am not necessarily opposed to doing something like this by default, however I do have a few questions.
Basically what I am fishing for is a supporting argument for the extra complexity of maintaining a container network namespace. Like I mentioned I am still very traumatized from the CNI discussions, and my intuition is telling me to keep things flat/simple and focus on network devices in favor of complex synthetic overlay networks. I understand these overlay networks are possible, I just know they introduce a lot of complexity, risk, and overhead from a performance perspective. I want to simplify things. I want Aurae to be secure, and boring. Maybe a better way of framing what I am asking:Is it reasonable to have the root |
You can have a flat network model if you address the nested auraed by link-local addresses only. Still, you'd have the network namespace for the pod, but it's directly routed into/out of the VM without an overlay or VLAN. So if the nested auraed doesn't need to be reachable from anything but the host auraed, we can use link-local IPv6 addresses, while routing the pod network into the VM. I understand you don't like CNIs. But how do you feel about sidecars? By not separating auraed from customers' containers you effectively inject an aurae sidecar into every Pod. |
if the network namespace can communicate with the auraed at the pod level (micro VM), how is this different to injecting an auraed sidecar into the pod? is this about some notion of separation of infrastructure (auraed) from business application (pod) and it's a simpler story to tell if we have the separated network namespace? are there any benefits to having this separated network namespace beyond a conceptual model improvement? |
I don't think aurae should have the concept of pods other than as a specific integration for hosting k8s workloads. It is a feature that seems to align too closely with the scaling goals of k8s that I don't think this project has. |
How did we get here?
So I was streaming recently and starting to look through the implementation detail of how we are implementing the container runtime interface, CRI.
Naturally this opened up a can of worms. One implementation detail lead to another, and it quickly spiraled out of control. This left me spending the weekend thinking to myself about what the project should do. I want this GitHub issue to serve as a decision/architecture (ADR) for the detail of what we intend to do.
But first, some context, history, and vocabulary.
What is a "Pod Sandbox" and where did it come from?
Here is the shape that I think of when I think of what most of the industry refers to as "a pod".
which is basically to say that its a bounded set of containers that exist within some isolation zone. Kubernetes, for example, likes to pretend that the containers within a pod all share the same localhost, storage, network, etc.
In the context of OpenShift sandboxed containers, a pod is implemented as a virtual machine. Several containers can run in the same pod on the same virtual machine.[1]
The history of a Pod (as I understand it) is relatively simple, and makes sense given the behavior of the
clone(2)
andclone3(2)
system calls. Basically you cannot "create" a new namespace in Linux. You can however, execute a new process in a new namespace. So what do you do when you just want an "empty" boundary and aren't ready to start any work in your namespace yet? Or more importantly, how do you keep the namespace around if your container exits? Linux will destroy namespaces if there is no longer something executing in the namespace.There is some historical context that mentions that the Kubernetes Pause Container was the answer to this problem.
clone(2)
orclone3(2)
with a new Pause process.Thus, the paradigm of the pod sandbox was created as a way to hold a set of these containers together.
Here are some more resources:
Option 1) A Pod is a VM
This is a straightforward proposal and can be a viable and powerful path for Aurae to adopt.
Basically we follow suit with OpenShift, Gvisor, and Firecracker and establish a virtualization zone (basically a VM) for every pod by default.
Once the VM has been started we can delegate out to the nested
auraed
run a container using our own RPC. The containers can share the same namespaces as the host, and we can mount volumes between them, communicate over the local network, etc, etc. We can bake in more logic (such as network devices) as well in the future.Implementation would look like:
auraed
VM.auraed
over the network, and schedule a container using a new RPC such asRunContainer()
.Option 2) A pod is a container, and we spritz up our cells
In this option we would need to do 2 things.
This option is attractive because it solves the package management and supply chain concerns because everything becomes a tarball/OCI image at the end of the day.
Basically we would create a new Youki container with a nested
auraed
running as an init process. Then we can access theauraed
RPC for cells, and send an OCI image to the cell service to un-tar the image and "install" it as we would with any package manager. This kind of violates the entire supply chain guarantee and image immutability thing that everyone seems to love about containers so I am not sure this is a good approach. However this also feels a lot more intuitive to anyone used to systemd and bare metal machines.This approach would involve a new RPC for the cell service that allows the user to pass a remote URL for an OCI image/tarball for the cell service to download and install. The cells would be created inside the container, and they could just do what they needed.
One thing to figure out would be the need to chroot each cell filesystem, as otherwise we have no way of preventing 2 "containers" from sharing the same files/paths/directory structure. The fact that we would need to chroot each cell (that the user would be calling a container) is a red flag.
Option 3) A pod is a container, and your containers are also containers
Basically we create a new
auraed
container when a user creates a new pod sandbox. We establish new namespaces for the new container. When it comes time to schedule a nested container inside the new pod sandbox we call out to the nestedauraed
and say "RunContainer" but just use the namespaces from the first pod.The output here would be a node with a LOT of containers floating around, all with a "virtual" structure. In other words we would have a flat list of containers from the host's perspective and the structure and isolation is only enforced by how we expose namespaces to containers.
This feels... wrong.. I can't explain why... I believe this is how a lot of container runtimes do things now and it just seems to be an anti pattern as we could actually build recursive isolation boundaries.
The Decision
I want some help talking through the decision -- We can close the issue once we have come to conviction internally.
The text was updated successfully, but these errors were encountered: