-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Forensic Container Checkpointing KEP #1990
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
kep-number: 2008 | ||
alpha: | ||
approver: "@ehashman" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,307 @@ | ||
# KEP-2008: Forensic Container Checkpointing | ||
|
||
<!-- toc --> | ||
- [Release Signoff Checklist](#release-signoff-checklist) | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [Implementation](#implementation) | ||
- [User Stories](#user-stories) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Future Enhancements](#future-enhancements) | ||
- [Test Plan](#test-plan) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Alpha](#alpha) | ||
- [Alpha to Beta Graduation](#alpha-to-beta-graduation) | ||
- [Beta to GA Graduation](#beta-to-ga-graduation) | ||
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||
- [Feature Enablement and Rollback](#feature-enablement-and-rollback) | ||
- [Dependencies](#dependencies) | ||
- [Scalability](#scalability) | ||
- [Implementation History](#implementation-history) | ||
- [Drawbacks](#drawbacks) | ||
- [Alternatives](#alternatives) | ||
<!-- /toc --> | ||
|
||
## Release Signoff Checklist | ||
|
||
Items marked with (R) are required *prior to targeting to a milestone / release*. | ||
|
||
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) | ||
- [ ] (R) KEP approvers have approved the KEP status as `implementable` | ||
- [ ] (R) Design details are appropriately documented | ||
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input | ||
- [ ] (R) Graduation criteria is in place | ||
- [ ] (R) Production readiness review completed | ||
- [ ] Production readiness review approved | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
||
[kubernetes.io]: https://kubernetes.io/ | ||
[kubernetes/enhancements]: https://git.k8s.io/enhancements | ||
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes | ||
[kubernetes/website]: https://git.k8s.io/website | ||
|
||
## Summary | ||
|
||
Provide an interface to trigger a container checkpoint for forensic analysis. | ||
|
||
## Motivation | ||
|
||
Container checkpointing provides the functionality to take a snapshot of a | ||
running container. The checkpointed container can be transferred to another | ||
node and the original container will never know that it was checkpointed. | ||
|
||
Restoring the container in a sandboxed environment provides a mean to | ||
forensically analyse a copy of the container to understand if it might | ||
have been a possible threat. As the analysis is happening on a copy of | ||
the original container a possible attacker of the original container | ||
will not be aware of any sandboxed analysis. | ||
|
||
### Goals | ||
|
||
The goal of this KEP is to introduce *checkpoint* and *restore* to the CRI API. | ||
This includes extending the *kubelet* API to support checkpointing single | ||
containers with the forensic use case in mind. | ||
|
||
### Non-Goals | ||
|
||
Although *checkpoint* and *restore* can be used to implement container | ||
adrianreber marked this conversation as resolved.
Show resolved
Hide resolved
|
||
migration this KEP is only about enabling the forensic use case. Checkpointing | ||
a pod is not part of this proposal and left for future enhancements. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checkpointing a pod may be required for the forensic use case in the case of sandboxed (e.g. micro VM or gVisor) pods. I think it's fine to leave that out of the initial scope, but just wanted to name it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. Checkpointing a pod is not a problem and we demonstrated it in one of the previous proof of concept implementations. We just left it out for this KEP as we only wanted to focus on the things we want to do. |
||
|
||
## Proposal | ||
|
||
### Implementation | ||
|
||
For the forensic use case we want to offer the functionality to checkpoint a | ||
container out of a running Pod without stopping the checkpointed container or | ||
letting the container know that it was checkpointed. | ||
|
||
The corresponding code changes for the forensic use case can be found in the | ||
following pull request: | ||
|
||
* https://github.com/kubernetes/kubernetes/pull/104907 | ||
|
||
The goal is to introduce *checkpoint* and *restore* in a bottom-up approach. | ||
In a first step we only want to extend the CRI API to trigger a checkpoint | ||
by the container engine and to have the low level primitives in the *kubelet* | ||
to trigger a checkpoint. It is necessary to enable the feature gate | ||
`ContainerCheckpoint` to be able to checkpoint containers. | ||
|
||
In the corresponding pull request a checkpoint is triggered using the *kubelet* | ||
API: | ||
|
||
``` | ||
curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be seen as a highly privileged operation. How will it be authorized? This seems on par with exec permissions, but exec uses the highly overloaded There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this question correctly. To do this operation the normal kubelet API access restrictions apply. I can only do it without authorization if I set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kubelet authorization (without At the bottom of that link, you'll see that we use a dedicated subresource for some requests (logs, stats & metrics) but everything else just uses the highly overloaded I'd like for all new Kubelet APIs to use a new dedicated subresource (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tallclair thanks for the feedback, for the forensic use case, the initial desire was root on the node but not root in the cluster. @adrianreber i think the net of this is we need to ensure that the test case for this function does not redirect back to proxy, but is its own dedicated subresource. for beta, we should add a section on authorization. see: |
||
``` | ||
|
||
For the first implementation we do not want to support restore in the | ||
*kubelet*. With the focus on the forensic use case the restore should happen | ||
outside of Kubernetes. The restore is a container engine only operation | ||
in this first step. | ||
|
||
The forensic use case is targeted to be part of the next (1.24) release. | ||
|
||
Although this KEP only adds checkpointing support to the kubelet the CRI API in | ||
the corresponding code pull request is extended to support *checkpoint* and | ||
*restore* in the CRI API. The reason to add *restore* to the CRI API without | ||
implementing it in the kubelet is to make development and especially testing | ||
easier on the container engine level. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a corresponding proposal for the implementation of the API in cri-o and containerd? I think it would be helpful to get CRI implementations on board before dictating the API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes: cri-o/cri-o#4199 We used the CRI-O implementation to checkpoint and restore containers and pods in one of the early proof of concept implementations. There is also: kubernetes-sigs/cri-tools#662 |
||
|
||
### User Stories | ||
|
||
To analyze unusual activities in a container, the container should | ||
be checkpointed without stopping the container or without the container | ||
knowing it was checkpointed. Using checkpointing it is possible to take | ||
a copy of a running container for forensic analysis. The container will | ||
continue to run without knowing a copy was created. This copy can then | ||
be restored in another (sandboxed) environment in the context of another | ||
container engine for detailed analysis of a possible attack. | ||
|
||
### Risks and Mitigations | ||
|
||
In its first implementation the risks are low as it tries to be a CRI API | ||
change with minimal changes to the kubelet and it is gated by the feature | ||
gate `ContainerCheckpoint`. | ||
|
||
## Design Details | ||
|
||
The feature gate `ContainerCheckpoint` will ensure that the API | ||
graduation can be done in the standard Kubernetes way. | ||
|
||
A kubelet API to trigger the checkpointing of a container will be | ||
introduced as described in [Implementation](#implementation). | ||
|
||
adrianreber marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Also see https://github.com/kubernetes/kubernetes/pull/104907 for details. | ||
|
||
### Future Enhancements | ||
|
||
The initial implementation is only about checkpointing specific containers | ||
out of a pod. In future versions we probably want to support checkpointing | ||
complete pods. To checkpoint a complete pod the expectation on the container | ||
engine would be to do a pod level cgroup freeze before checkpointing the | ||
containers in the pod to ensure that all containers are checkpointed at the | ||
same point in time and that the containers do not keep running while other | ||
containers in the pod are checkpointed. | ||
|
||
One possible result of being able to checkpoint and restore containers and pods | ||
might be the possibility to migrate containers and pods in the future as | ||
discussed in [#3949](https://github.com/kubernetes/kubernetes/issues/3949). | ||
|
||
adrianreber marked this conversation as resolved.
Show resolved
Hide resolved
|
||
### Test Plan | ||
|
||
For alpha: | ||
- Unit tests available | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There also should be an e2e test for this for alpha. I'm not sure how we would unit test the new endpoint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason why I went for unit tests only is that the none of the CRI implementations are providing the necessary functionality yet. So we would get a "NotImplemented". I can do that. Would that work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack, it would give an extension point for the future so not critical. |
||
|
||
For beta: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add detail on checkpoint authorization, we will need to restrict access to the kubelet api resource. on the container runtime, the actual checkpoint is stored in a location is restricted, but prior to beta, we need clear security practices documented. |
||
- CRI API changes need to be implemented by at least one | ||
container engine | ||
- Enable e2e testing | ||
|
||
### Graduation Criteria | ||
|
||
#### Alpha | ||
|
||
- [ ] Implement the new feature gate and kubelet implementation | ||
- [ ] Ensure proper tests are in place | ||
- [ ] Update documentation to make the feature visible | ||
|
||
#### Alpha to Beta Graduation | ||
|
||
At least one container engine has to have implemented the | ||
corresponding CRI APIs to introduce e2e test for checkpointing. | ||
|
||
- [ ] Enable the feature per default | ||
- [ ] No major bugs reported in the previous cycle | ||
|
||
#### Beta to GA Graduation | ||
|
||
TBD | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
No changes are required on upgrade if the container engine supports | ||
the corresponding CRI API changes. | ||
|
||
## Production Readiness Review Questionnaire | ||
|
||
### Feature Enablement and Rollback | ||
|
||
###### How can this feature be enabled / disabled in a live cluster? | ||
|
||
- [x] Feature gate | ||
- Feature gate name: `ContainerCheckpoint` | ||
|
||
###### Does enabling the feature change any default behavior? | ||
|
||
No. | ||
|
||
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
Yes. By disabling the feature gate `ContainerCheckpoint` again. | ||
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
Checkpointing containers will be possible again. | ||
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
Currently no. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sure you manually test this. (i.e. run kubelet with feature gate on, test feature, turn feature gate off and restart it, test that it's disabled) |
||
|
||
### Dependencies | ||
|
||
CRIU needs to be installed on the node, but on most distributions it is already | ||
a dependency of runc/crun. It does not require any specific services on the | ||
cluster. | ||
|
||
### Scalability | ||
|
||
###### Will enabling / using this feature result in any new API calls? | ||
|
||
The newly introduced CRI API call to checkpoint a container/pod will be | ||
used by this feature. The kubelet will make the CRI API calls and it | ||
will only be done when a checkpoint is triggered. No periodic API calls | ||
will happen. | ||
|
||
###### Will enabling / using this feature result in introducing new API types? | ||
|
||
No. | ||
|
||
###### Will enabling / using this feature result in any new calls to the cloud provider? | ||
|
||
No. | ||
|
||
###### Will enabling / using this feature result in increasing size or count of the existing API objects? | ||
|
||
No. | ||
|
||
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? | ||
|
||
No. It will only affect checkpoint CRI API calls. | ||
|
||
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||
|
||
During checkpointing each memory page will be written to disk. Disk usage will increase by | ||
the size of all memory pages in the checkpointed container. Each file in the container that | ||
has been changed compared to the original version will also be part of the checkpoint. | ||
Disk usage will overall increase by the used memory of the container and the changed files. | ||
Checkpoint archive written to disk can optionally be compressed. The current implementation | ||
does not compress the checkpoint archive on disk. | ||
|
||
adrianreber marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Implementation History | ||
|
||
* 2020-09-16: Initial version of this KEP | ||
* 2020-12-10: Opened pull request showing an end-to-end implementation of a possible use case | ||
* 2021-02-12: Changed KEP to mention the *experimental* API as suggested in the SIG Node meeting 2021-02-09 | ||
* 2021-04-08: Added section about Pod Lifecycle, Checkpoint Storage, Alternatives and Hooks | ||
* 2021-07-08: Reworked structure and added missing details | ||
* 2021-08-03: Added the forensic user story and highlight the goal to implement it in small steps | ||
* 2021-08-10: Added future work with information about pod level cgroup freezing | ||
* 2021-09-15: Removed references to first proof of concept implementation | ||
* 2021-09-21: Mention feature gate `ContainerCheckpointRestore` | ||
* 2021-09-22: Removed everything which is not directly related to the forensic use case | ||
* 2022-01-06: Reworked based on review | ||
* 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint` | ||
|
||
## Drawbacks | ||
|
||
During checkpointing each memory page of the checkpointed container is written to disk | ||
which can result in slightly lower performance because each memory page is copied | ||
to disk. It can also result in increased disk IO operations during checkpoint | ||
creation. | ||
|
||
In the current CRI-O implementation the checkpoint archive is created so that only | ||
the `root` user can access it. As the checkpoint archive contains all memory pages | ||
a checkpoint archive can potentially contain secrets which are expected to be | ||
in memory only. | ||
|
||
The current CRI-O implementations handles SELinux labels as well as seccomp and restores | ||
these setting as they were before. A possibly restored container is as secure as | ||
before, but it is important to be careful where the checkpoint archive is stored. | ||
|
||
During checkpointing CRIU injects parasite code into the to be checkpointed process. | ||
On a SELinux enabled system the access to the parasite code is limited to the | ||
label of corresponding container. On a non SELinux system it is limited to the | ||
`root` user (which can access the process in any way). | ||
|
||
## Alternatives | ||
|
||
Another possibility to use checkpoint restore would be, for example, to trigger | ||
the checkpoint by a privileged sidecar container (`CAP_SYS_ADMIN`) and do the | ||
restore through an Init container. | ||
|
||
The reason to integrate checkpoint restore directly into Kubernetes and not | ||
with helpers like sidecar and init containers is that checkpointing is already, | ||
for many years, deeply integrated into multiple container runtimes and engines | ||
and this integration has been reliable and well tested. Going another way in | ||
Kubernetes would make the whole process much more complicated and fragile. Not | ||
using checkpoint and restore in Kubernetes through the existing paths of | ||
runtimes and engines is not well known and maybe not even possible as | ||
checkpointing and restoring is tightly integrated as it requires much | ||
information only available by working closely with runtimes and engines. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
title: Forensic Container Checkpointing | ||
kep-number: 2008 | ||
authors: | ||
- "@adrianreber" | ||
owning-sig: sig-node | ||
participating-sigs: | ||
- TBD | ||
status: implementable | ||
creation-date: 2020-09-16 | ||
last-updated: 2022-01-20 | ||
reviewers: | ||
- "@mrunalp" | ||
- "@elfinhe" | ||
approvers: | ||
- "@dchen1107" | ||
prr-approvers: | ||
- "@ehashman" | ||
|
||
# The target maturity stage in the current dev cycle for this KEP. | ||
stage: alpha | ||
|
||
# The most recent milestone for which work toward delivery of this KEP has been | ||
# done. This can be the current (upcoming) milestone, if it is being actively | ||
# worked on. | ||
latest-milestone: "v1.24" | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v1.24" | ||
beta: "v1.25" | ||
stable: "v1.27" | ||
|
||
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: ContainerCheckpoint | ||
components: | ||
- kubelet | ||
disable-supported: true | ||
|
||
# The following PRR answers are required at beta release | ||
metrics: | ||
- "N/A" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scope of what's included in the checkpoint? Memory snapshot? Writeable layer snapshot? RO snapshot? What about volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple answer is, everything that is in the container.
Everything external (devices, mounts, volumes) is not.
Volumes are special because they exist as unnamed volumes (I hope that is the right name) that is included in our current Podman implementation. External volumes are not included.
Everything external needs additional handling. Right now it is manually but it can become part of Kubernetes at some point. Depending on the type of external resources if it can be migrated to a new location.