Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DEP-06: Immutable ETCD Backups #884

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

seshachalam-yv
Copy link
Contributor

@seshachalam-yv seshachalam-yv commented Oct 1, 2024

How to categorize this PR?

/area backup
/area disaster-recovery
/area security
/area compliance
/area storage
/kind enhancement

What this PR does / why we need it:
This PR adds DEP-06: Immutable ETCD Backups. The proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Add DEP-06: Immutable ETCD Backups

---------

Co-authored-by: Saketh Kalaga <[email protected]>
@seshachalam-yv seshachalam-yv requested a review from a team as a code owner October 1, 2024 11:46
@gardener-robot gardener-robot added needs/review Needs review area/backup Backup related area/compliance Compliance related area/disaster-recovery Disaster recovery related area/security Security related area/storage Storage related kind/enhancement Enhancement, improvement, extension size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Oct 1, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 1, 2024
@seshachalam-yv seshachalam-yv changed the title Add DEP-06: Immutable ETCD Backups Add DEP-06: Immutable ETCD Backups Oct 2, 2024
@seshachalam-yv seshachalam-yv changed the title Add DEP-06: Immutable ETCD Backups Add DEP-06: Immutable ETCD Backups Oct 2, 2024
@seshachalam-yv seshachalam-yv changed the title Add DEP-06: Immutable ETCD Backups Add DEP-06: Immutable ETCD Backups Oct 2, 2024
@anveshreddy18 anveshreddy18 self-assigned this Oct 3, 2024
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

- Implement immutable backup support for ETCD clusters.
- Secure backup data against unintended or unauthorized modifications after creation.
- Ensure backups are consistently available and intact for restoration purposes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make backups consistently available that's job of storage provider, so it's wrong to mention this I guess.
I think you can mention about enhancing the garbage collection of backup-restore to work with immutable backups

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel point 3 is not something that we will be doing.
Instead yup we can mention how we manage the life cycle of these backups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other points also don't make too much sense. Only the second point is a goal. Implementing immutable backup support is a way to achieve the goal.
Only keeping point 2.

## Glossary

- **ETCD:** A distributed key-value store used as the backing store for Kubernetes.
- **Compaction Job:** A process that compacts ETCD snapshots to reduce storage size and improve performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be you can here mention about snapshot compaction DEP link: https://github.com/gardener/etcd-druid/blob/master/docs/proposals/02-snapshot-compaction.md

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

- **Type:** Duration
- **Default:** `24h`
- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that there should be atleast 1 full snapshot remains within the immutability period and are safeguarded against becoming mutable.

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Oct 9, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 10, 2024
Copy link
Collaborator

@ashwani2k ashwani2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @seshachalam-yv @ishan16696 @renormalize for the proposal.
It captures thing well, but I've put some open points esp. on the structure as well some details esp. as it addresses design considerations.

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

- Implement immutable backup support for ETCD clusters.
- Secure backup data against unintended or unauthorized modifications after creation.
- Ensure backups are consistently available and intact for restoration purposes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel point 3 is not something that we will be doing.
Instead yup we can mention how we manage the life cycle of these backups.

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

### Excluding Snapshots Under Specific Circumstances

Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen even outside of immutable backups scenarios as well, so how is this handled there? I'm guessing currently by deleting manually the affected snapshots.
But with this new approach it should be same mechanism there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If snapshots are mutable, this is achieved through deletion of snapshots.
The same functionality will be achieved through custom metadata tags. Will enhance the doc for this.


Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:

- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear from the doc who takes care of attaching the custom metadata flag and how its consumed? Can we describe here to avoid any unintended interpretation of the flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Human operators add these tags; will include this.


## Implementation Steps

1. **Enhance the Compaction Job:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create a new name for the job for Hibernated Full Snapshots and ensure that we have flags and even flow which can leverage the existing compaction feature and enhance it with additional change required for Immuatable backup snapshotting and garbage collection.

Also we cannot have a compaction job for hibernated cluster in practical terms, so it will be even more confusing to see a compaction job running for a hibernated cluster.

- Configure buckets with appropriate immutability settings before deploying ETCD clusters.
- Ensure that the immutability periods align with organizational policies.

- **Compaction Job Configuration:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the retry threshold for this job?
What happens if it fails to run for a period of 24hrs.
What happens if druid is down?
What happens when druid comes back up esp. for failed jobs which have breached the retry threshold?
What happens if we breach the bucket retention period? Is no data to restore possible on wake-up of hibernated clusters.
Does garbage collection runs independent or in sequence only after the job takes a full snapshot on its run.

docs/proposals/06-immutable-etcd-backups.md Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved
docs/proposals/06-immutable-etcd-backups.md Show resolved Hide resolved
@renormalize renormalize self-assigned this Oct 10, 2024
@renormalize renormalize added this to the v0.25.0 milestone Nov 13, 2024
* The operator task framework is used to enhance the proposal in the approach which re-uploads the latest full snapshot to prolong the immutability.

---------

Co-authored-by: Seshachalam Yerasala Venkata <[email protected]>
@gardener-robot gardener-robot added the size/l Size of pull request is large (see gardener-robot robot/bots/size.py) label Nov 21, 2024
Copy link
Collaborator

@ashwani2k ashwani2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the well written DEP.
I've some suggestions in structure, naming, and usage. Please have a look.


#### ETCD Backup Configuration

Operators must ensure that the ETCD backup configuration aligns with the immutability requirements, including setting appropriate immutability periods.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we elaborate here on the ETCD backup configuration or the shoot providerConfig changes for the Backup that we propose to bring as part of this feature.
Also, may be how the same should be passed to standalone druid usage can also be mentioned.

Its important as after this section we suddenly jump into handling of hibernated clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have referred to the links for configuring immutable backup buckets for both standalone and Gardener cases. We have removed the compaction approach since both the compaction and re-uploading approaches use the operator framework and take full snapshots in the same way. The only difference is starting the embedded etcd. The more appropriate approach is re-uploading. Therefore, we have removed the compaction approach as it is redundant, doing the same thing apart from starting the embedded etcd and compacting.


#### Handling of Hibernated Clusters

When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this behavior depends on the cloud provider; refer to Comparison of Storage Provider Properties

Is there a variance with cloud provider on the preceeding statement "backups become mutable again"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Behavior after expiry of objects behaves on the cloud provider.


When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.

Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions.
Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the design of the solution.


**Proposed Solution:**

Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a derived information, based on the last full snapshot time and the etcd.spec.replicas being 0.
We already have this information so why not use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed the compaction approach as mentioned here.

Anyways, controller periodically creates the ExtendEtcdSnapshotImmutabilityTask if etcd.spec.backup.store.immutability.retentionType is set to "Bucket" and based on etcd.spec.backup.fullSnapshotSchedule.

- Introduce a new flag:
- **Flag:** `--hibernation-snapshot-interval`
- **Type:** Duration
- **Default:** `24h`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone sets this value to more than 24h lets say sets it to 72h, won't this already break the contract of immutability. I think we should not expose this internal detail as a config, unless you have a case at hand where this is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the responsibility of the operator that sets up etcd-druid to configure this flag correctly.
For example, if the bucket is configured to be immutable for 15 days, then the operator wanting to trigger snapshots every 3 days is fine, a new snapshot every day is unnecessary so this can be left configurable, in my opinion.

- The controller scales in the ETCD cluster (i.e., sets `StatefulSet.spec.replicas` to zero).
- The controller creates the `EtcdSnapshotImmutabilityExtension` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.

- **`EtcdSnapshotImmutabilityExtension` specification:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **`EtcdSnapshotImmutabilityExtension` specification:**
- **`ExtendEtcdSnapshotImmutability` specification:**

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed


- **Backward Compatibility:**
- Existing clusters without immutable buckets will continue to function without change.
- The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated.
- The introduction of the `ExtendEtcdSnapshotImmutability` does not affect clusters that are not hibernated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable.
- **Update `Etcd` CRD:**
- Add `etcd.spec.hibernation`:
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear how will the controller will adopt the existing ETCD resources which have a hibernation schedule already in place, esp. in the context of Gardener usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed the section hibernation (support for specifying an intent for hibernation) as a non goal, since this will be handled with different DEP #922.


###### Disadvantages

- **Additional Complexity:** Requires updates to the etcd controller, introduction of the operator-tasks controller, and introduction of new etcdbrctl commands.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we see this as additional complexity, weren't we planning to anyways implement an operator-task controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional complexity since it is a hard prerequisite; but you are right. This is the only way forward; will remove this.


- **Resource Consumption:** Starting an embedded ETCD instance periodically consumes resources.

##### Approach 2: Re-upload of the latest snapshot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call this approach -- Re-upload of the latest snapshot while in conclusion section we have called this approach Copy backup task. Can we have one naming convention for the approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a miss while renaming the approach to "re-upload of the latest"

@shreyas-s-rao shreyas-s-rao modified the milestones: v0.25.0, v0.26.0 Nov 26, 2024
This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable.
- **Update `Etcd` CRD:**
- Add `etcd.spec.hibernation`:
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernation`.

The field is called hibernation in some places and hibernated in other places. Can you check which one you wanted to have and correct the remaining to that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. Will do this.


- Add `immutableSettings.retentionType` under `etcd.spec.backup.store`.
- **ETCD Controller Logic:**
- When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:
- When hibernation is requested, by changing `etcd.spec.hibernation.enabled` to `true`:

Copy link
Contributor

@shreyas-s-rao shreyas-s-rao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renormalize @seshachalam-yv thanks a lot for the wonderful proposal, and for detailing out ways to make backups immutable. I have a few comments/questions. PTAL, thanks!

- [etcd-backup-restore PR #776](https://github.com/gardener/etcd-backup-restore/pull/776)
- [EtcdCopyBackupsTask Implementation](https://github.com/gardener/etcd-druid/pull/544)

## Glossary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if the Glossary was at the beginning, after Summary section. Reason being, it should be easy for a new reader to simply read through the entire document sequentially without having to jump between sections in the document.

Copy link
Contributor Author

@seshachalam-yv seshachalam-yv Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.
We have moved and renamed to terminology after Summary section.


### Overview

We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example of the previous comment, you mention here about WORM model, but have not yet defined it anywhere, and the reader is required to scroll down to the glossary and come back to this section later. This seems slightly non-intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,315 @@
---
title: Immutable ETCD Backups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently learned that etcd must always be written in lower case characters, ie etcd, and not ETCD. There are apparently certain conventions around the way it is pronounced as well. Can we change all occurrences of ETCD to etcd if that's possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've already started doing so. I've been a proponent of this since the start.

For future reference, could you link the resource where you read these guidelines for me and the other maintainers as well? Thanks.


This proposal aims to enhance the reliability and integrity of ETCD backups created by `etcd-backup-restore` in ETCD clusters managed by `etcd-druid`, by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, unauthorized modifications to backup data are prevented, ensuring that backups remain intact and accessible for restoration.

The proposed solution relies on `etcd-druid` to manage ETCD backups and handle hibernation processes effectively. It leverages one of the suggested approaches to ensure backups remain immutable over extended periods. It is important to note that using `etcd-backup-restore` standalone may not be sufficient to achieve this functionality end-to-end, as the immutability handling (with respect to hibernation) is specifically managed within `etcd-druid`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hibernation processes doesn't seem to have been defined yet, and will be difficult for a non-Gardener reader to understand this term. I see that it's in the glossary, so maybe you can link to that section, or give some background about this here.


### Overview

We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually recommended to use the term the authors rather than we, to refer to the authors in third person. Can you please make that change everywhere in the proposal?

- **Failed Snapshot Before Hibernation:**

- **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process.
- **Mitigation:** Implement robust error handling and retries. Notify operators of failures to take corrective action.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior if full snapshot fails after repeated retries? Will this infinitely block etcd cluster scale-down? Or would you propose to allow it to happen, but update Etcd.status.lastErrors and have an operator look into it? If case 2, then the extend-immutability approach might not work, since it only copies the latest full snapshot and not the latest set of snapshots. Would that need to be changed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior if full snapshot fails after repeated retries? Will this infinitely block etcd cluster scale-down?

Yes.


- **Review Retention Policies:**

- Set `maxBackups` and `maxBackupAge` in the `EtcdCopyBackupsTask` to manage storage utilization effectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might be insufficient. I would instead propose to introduce a new flag like latest-snapshot-set-only to tell the copy command to only copy over the latest set of snapshots, and not be bothered about the snapshot age or number of snapshots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned #884 (comment), we are not going to use copy command.


#### Considerations for Object-Level Immutability

Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on these possible scenarios? Would be good to have an idea of what you have already thought of as a use case for object-level immutability, and why you feel it's less practical to support such use cases with object-level immutability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with object-level immutability all you need is to extend the immutability for the latest full snapshot. You do not really need to copy the full snapshot which is currently a hacky solution put in place because we are not going directly to object-level immutability policy.


#### Conclusion

Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make the recommendation in bold, so that it stands out as your final expert opinion ;)


Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.

##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this comparison section!

@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 3, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 3, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 4, 2024
- Improve readability and clarity of the summary and motivation sections.
- Add detailed terminology definitions.
- Refine the proposal to focus on bucket-level immutability.
@gardener-robot gardener-robot added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed size/l Size of pull request is large (see gardener-robot robot/bots/size.py) labels Dec 5, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 5, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 5, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 5, 2024

## Summary

Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.
`etcd-druid` provisions etcd clusters and manage their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod.```


Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.

All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is not required. The intent is to protect the integrity of backups once uploaded for a defined period of time as these represent an etcd cluster state at a given time and enables restoration. This can be done in many different ways therefore talking about one particular Secret only will not be complete. But then again your intent here is not to list all possible attack vectors which can comprise the backups and therefore i would totally avoid mentioning this here.


All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations.

To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.
Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing WORM (write-once-read-many) protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.```


To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.

This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [*immutable*](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This para is not required if you are ok with the change suggested in the above para.


This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [*immutable*](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications.

`etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality currently achieved by modifying or deleting backups, but without actually modifying or deleting these backups, since they will now be immutable for a set duration. This approach eliminates the possibility of potential data loss. `etcd-druid` will provide an end-to-end solution for achieving this functionality, as relying solely on `etcd-backup-restore` is insufficient given the scope and possible approaches to achieving this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This para is not at all clear to me.


#### Considerations for Object-Level Immutability

Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with object-level immutability all you need is to extend the immutability for the latest full snapshot. You do not really need to copy the full snapshot which is currently a hacky solution put in place because we are not going directly to object-level immutability policy.


Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.

- Enabling object-level immutability requires bucket-level immutability to be set first (applicable in S3 and ABS). In GCS, the capability to enable object-level immutability on an existing bucket is not available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also mention the plan/date by which GCS is going to enable object level immutability. If there is an issue open please provide a link here.


**Disadvantages:**

- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this point a bit more? What is not universal?

**Disadvantages:**

- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported.
- **Increased Complexity:** Requires additional logic in backup processes and tooling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the increase in complexity as compared to what we are already introducing for bucket level immutability policies?


- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported.
- **Increased Complexity:** Requires additional logic in backup processes and tooling.
- **Prerequisites:** Some providers require bucket-level immutability to be set first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a con?

@unmarshall
Copy link
Contributor

Thanks for the PR @seshachalam-yv and @renormalize. I have added comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related area/compliance Compliance related area/disaster-recovery Disaster recovery related area/security Security related area/storage Storage related kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/m Size of pull request is medium (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.