Skip to content

Introduce WG Checkpoint Restore #8508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions OWNERS_ALIASES
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,11 @@ aliases:
- mwielgus
- soltysh
- swatisehgal
wg-checkpoint-restore-leads:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is everyone on this list a kubernetes org member?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. I talked with @haircommander and he would be willing to sponsor me. Still looking for a sponsor from another company.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/community/blob/master/community-membership.md#requirements

  • Sponsored by 2 reviewers. Note the following requirements for sponsors:
    • Sponsors must have close interactions with the prospective member - e.g. code/design/proposal review, coordinating on issues, etc.
    • Sponsors must be reviewers or approvers in at least one OWNERS file within one of the Kubernetes GitHub organizations*.
    • Sponsors must be from multiple member companies to demonstrate integration across community.

Are there any other existing community members interested in helping run this effort?

- adrianreber
- haircommander
- rst0git
- viktoriaas
wg-data-protection-leads:
- xing-yang
- yuxiangqian
Expand Down
1 change: 1 addition & 0 deletions sig-api-machinery/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-api-machinery:
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Structured Logging](/wg-structured-logging)


Expand Down
6 changes: 6 additions & 0 deletions sig-auth/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,12 @@ subprojects, and resolve cross-subproject technical issues and decisions.
- [@kubernetes/sig-auth-test-failures](https://github.com/orgs/kubernetes/teams/sig-auth-test-failures) - Test Failures and Triage
- Steering Committee Liaison: Patrick Ohly (**[@pohly](https://github.com/pohly)**)

## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-auth:
* [WG Checkpoint Restore](/wg-checkpoint-restore)


## Subprojects

The following [subprojects][subproject-definition] are owned by sig-auth:
Expand Down
1 change: 1 addition & 0 deletions sig-cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-cli:
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Node Lifecycle](/wg-node-lifecycle)


Expand Down
1 change: 1 addition & 0 deletions sig-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md)
| Name | Label | Stakeholder SIGs |Organizers | Contact | Meetings |
|------|-------|------------------|-----------|---------|----------|
|[Batch](wg-batch/README.md)|[batch](https://github.com/kubernetes/kubernetes/labels/wg%2Fbatch)|* Apps<br>* Autoscaling<br>* Node<br>* Scheduling<br>|* [Kevin Hannon](https://github.com/kannon92), Red Hat<br>* [Marcin Wielgus](https://github.com/mwielgus), Google<br>* [Maciej Szulik](https://github.com/soltysh), Defense Unicorns<br>* [Swati Sehgal](https://github.com/swatisehgal), Red Hat<br>|* [Slack](https://kubernetes.slack.com/messages/wg-batch)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-batch)|* Regular Meeting ([calendar](https://calendar.google.com/calendar/embed?src=8ulop9k0jfpuo0t7kp8d9ubtj4%40group.calendar.google.com)): [Thursdays (starting February 15th 2024)s at 3PM CET (Central European Time) (monthly)](https://zoom.us/j/98329676612?pwd=c0N2bVV1aTh2VzltckdXSitaZXBKQT09)<br>
|[Checkpoint Restore](wg-checkpoint-restore/README.md)|[checkpoint-restore](https://github.com/kubernetes/kubernetes/labels/wg%2Fcheckpoint-restore)|* API Machinery<br>* Auth<br>* CLI<br>* Node<br>* Scheduling<br>|* [Adrian Reber](https://github.com/adrianreber), Red Hat<br>* [Peter Hunt](https://github.com/haircommander), Red Hat<br>* [Radostin Stoyanov](https://github.com/rst0git), University of Oxford<br>* [Viktória Spišaková](https://github.com/viktoriaas), Masaryk University<br>|* [Slack](https://kubernetes.slack.com/messages/wg-checkpoint-restore)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore)|
|[Data Protection](wg-data-protection/README.md)|[data-protection](https://github.com/kubernetes/kubernetes/labels/wg%2Fdata-protection)|* Apps<br>* Storage<br>|* [Xing Yang](https://github.com/xing-yang), VMware<br>* [Xiangqian Yu](https://github.com/yuxiangqian), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-data-protection)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-data-protection)|* Regular WG Meeting: [Wednesdays at 9:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/j/6933410772)<br>
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture<br>* Autoscaling<br>* Network<br>* Node<br>* Scheduling<br>|* [John Belamaric](https://github.com/johnbelamaric), Google<br>* [Kevin Klues](https://github.com/klueska), NVIDIA<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting (Asia/Europe): [Wednesdays at 9:00 CET (Central European Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)<br>* Regular WG Meeting (Europe/America): [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)<br>
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle<br>* etcd<br>|* [Benjamin Wang](https://github.com/ahrtr), VMware<br>* [Ciprian Hacman](https://github.com/hakman), Microsoft<br>* [Josh Berkus](https://github.com/jberkus), Red Hat<br>* [James Blair](https://github.com/jmhbnz), Red Hat<br>* [Justin Santa Barbara](https://github.com/justinsb), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)<br>
Expand Down
1 change: 1 addition & 0 deletions sig-node/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-node:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
1 change: 1 addition & 0 deletions sig-scheduling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-scheduling:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
36 changes: 36 additions & 0 deletions sigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3583,6 +3583,42 @@ workinggroups:
liaison:
github: aojea
name: Antonio Ojea
- dir: wg-checkpoint-restore
name: Checkpoint Restore
mission_statement: >
This working group aims to provide a central location for the community to discuss
the integration of Checkpoint/Restore functionality into Kubernetes.

charter_link: charter.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is charter included into this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it is, I didn't add it initially as the lifecycle document mentions that it is added later, but looking at the WG PRs it seems to be common to have a charter in the initial PR.

stakeholder_sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig auth may have a big say in security of this whole restoration pipeline

Copy link

@rst0git rst0git Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! Security is definitely an important topic that we need to discuss with sig-auth, both for the checkpoint API and the restoration pipeline. The following paper and master thesis describe our recent work on this topic:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added sig auth to the list of stakeholder sigs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valuable initiative. The charter mentions that the scope includes checkpointing and restoring 'workloads' and providing 'guidance for developers on checkpoint-friendly app design.' Given this focus, it's essential for SIG Apps to be involved as a key stakeholder.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo This is a good idea, thank you so much for suggesting it!

- API Machinery
- Auth
- CLI
- Node
- Scheduling
label: checkpoint-restore
leadership:
chairs:
- github: adrianreber
name: Adrian Reber
company: Red Hat
email: [email protected]
- github: haircommander
name: Peter Hunt
company: Red Hat
email: [email protected]
- github: rst0git
name: Radostin Stoyanov
company: University of Oxford
email: [email protected]
- github: viktoriaas
name: Viktória Spišaková
company: Masaryk University
email: [email protected]
meetings: []
contact:
slack: wg-checkpoint-restore
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
- dir: wg-data-protection
name: Data Protection
mission_statement: >
Expand Down
37 changes: 37 additions & 0 deletions wg-checkpoint-restore/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
<!---
This is an autogenerated file!
Please do not edit this file directly, but instead make changes to the
sigs.yaml file in the project root.
To understand how this file is generated, see https://git.k8s.io/community/generator/README.md
--->
# Checkpoint Restore Working Group

This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes.

The [charter](charter.md) defines the scope and governance of the Checkpoint Restore Working Group.

## Stakeholder SIGs
* [SIG API Machinery](/sig-api-machinery)
* [SIG Auth](/sig-auth)
* [SIG CLI](/sig-cli)
* [SIG Node](/sig-node)
* [SIG Scheduling](/sig-scheduling)



## Organizers

* Adrian Reber (**[@adrianreber](https://github.com/adrianreber)**), Red Hat
* Peter Hunt (**[@haircommander](https://github.com/haircommander)**), Red Hat
* Radostin Stoyanov (**[@rst0git](https://github.com/rst0git)**), University of Oxford
* Viktória Spišaková (**[@viktoriaas](https://github.com/viktoriaas)**), Masaryk University

## Contact
- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)
<!-- BEGIN CUSTOM CONTENT -->

<!-- END CUSTOM CONTENT -->
90 changes: 90 additions & 0 deletions wg-checkpoint-restore/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@

# WG Checkpoint Restore Charter

This charter adheres to the conventions described in the [Kubernetes Charter README] and uses
the Roles and Organization Management outlined in [sig-governance].

## Scope

The Checkpoint/Restore Working Group aims to solve the problem of transparently
checkpointing and restoring workloads in Kubernetes, a functionality discussed
for over five years. The group will deliver the design and implementation of
Checkpoint/Restore functionality in Kubernetes, serving as a central hub for
community information and discussion. This initiative addresses a wide range of
problems, including fault tolerance, improved resource utilization, and
accelerated application startup times.

### In scope

- Identify core Kubernetes checkpoint/restore use cases (e.g., live migration,
fault tolerance, debugging, snapshotting) and gather stakeholder requirements.
- Investigate and propose Kubernetes APIs for checkpoint/restore operations.
- Work with SIGs for the best integration of checkpoint/restore functionality
and APIs.
- Provide guidance for developers on checkpoint-friendly app design and
recommendations for operators on feature management.
- Work closely with relevant upstream projects (CRI-O, containerd, CRIU)
for alignment and integration.
- Revisit the existing implementations to find and remedy possible inefficiencies.
One example is the existing checkpoint archive format which has already been
identified as being a major source of slowdown.

### Out of scope

- Not focused on general OS-level checkpointing outside Kubernetes
pods/containers.
- Will not dictate internal application checkpointing logic; focuses on
Kubernetes platform orchestration of *container/pod state.

## Stakeholders

Stakeholders in this working group span multiple SIGs that own parts of the
code in core kubernetes components and addons.

- SIG CLI
- SIG API Machinery
- SIG Node
- SIG Scheduling
- SIG Auth

## Deliverables

The list of deliverables include the following high level features:

- In the early stage, we mainly want to offer a well-defined location for the
community to find information, ask questions, and discuss the next steps of
enabling checkpoint and restore in Kubernetes.

Later:

- Ability to checkpoint and restore a container using kubectl
- Ability to checkpoint and restore a pod using kubectl
- Integration of container/pod checkpointing in scheduling decisions

## Roles and Organization Management

This WG adheres to the Roles and Organization Management outlined in [wg-governance]
and opts-in to updates and modifications to [wg-governance].

[wg-governance]: /committee-steering/governance/wg-governance.md

Additionally, the WG commits to:

- maintain a solid communication line between the Kubernetes groups and the
wider CNCF community
- submit a proposal to the KubeCon/CloudNativeCon maintainers track

## Timelines and Disbanding

As a first mandate, the WG will define a roadmap and tasks in the first quarter
of operation.

After that the WG will distribute the different tasks to different community
members to define possible APIs and how it can be integrated in Kubernetes.

Achieving the aforementioned deliverables, also mentioned in the `In Scope`
section, will allow us to decide when to disband this WG. There is no
expectations that the Working Group will be converted into a SIG long term.

[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md