K8SPSMDB-1219: PBM multi storage support #1843

egegunes · 2025-02-24T12:23:56Z

Operator always supported multiple storages but didn't have native
support for having multiple backup storages until v2.6.0. In operator we
were reconfiguring PBM every time user selects a storage for their
backups/restores different than the previous storage. This was causing
long wait periods esp. in storages with lots of backups due to resync
operation.

Another limitation was forcing users to have only 1 backup storage if
they want to enable point-in-time-recovery. In case of multiple
storages, PBM would upload oplog chunks to whatever storage is last used
by a backup/restore and this would make consistent recovery impossible.

PBM v2.6.0 added native support for multiple storages and these changes
introduce it to our operator:

* User can have one main storage in PBM configuration. Any other
  storages can be added as profiles.

Main storage can be found in:
```
kubectl exec cluster1-rs0-0 -c backup-agent -- pbm config
```

This commit introduces a new field `main` in storage spec:
```
storages:
  s3-us-west:
    main: true
    type: s3
    s3:
      bucket: operator-testing
      credentialsSecret: cluster1-s3-secrets
      region: us-west-2
````

If user only has 1 storage configured in `cr.yaml`, operator will
automatically use it as main storage. If more than 1 storage is
configured, one of them must have `main: true`. User can't have more
than 1 storage with `main: true`.

If user changes main storage in cr.yaml, operator will configure PBM
with the new storage and start resync.

Any other storage in `cr.yaml` will be added to PBM as a profile.

User can see profiles using cli:
```
kubectl exec cluster1-rs0-0 -c backup-agent -- pbm profile list
```

When user adds a new profile to `cr.yaml`, operator will add it to PBM
but won't start resync.

**`pbm config --force-resync` only start resync for the main storage.**

To manually resync a profile:
```
kubectl exec cluster1-rs0-0 -c backup-agent -- pbm profile sync <storage-name>
```

If user starts a restore using a backup in a storage configured as
profile, operator will start resync operation for profile and block
restore until resync finishes.

Note: Profiles are also called external storages in PBM documentation.

If user has multiple storages in `cr.yaml` and changes main storage
between them, operator:
1. configures PBM with the new main storage.
2. adds the old main as a profile.
2. deletes profile for the new main storage.

If user configures `backupSource` in backups/restores:
* if `cr.yaml` has no storages configured, operator configures PBM with
  storage data in `backupSource` field. This storage will effectively be
  the main storage until user adds a storage to `cr.yaml`. After a
  storage is configured PBM configuration will be overwritten and
  `backupSource` storage will be gone.
* if `cr.yaml` has a storage configured, operator adds `backupSource`
  storage as a profile.

* Oplog chunks will be only be uploaded to main storage.

User can use any backup as base backup for point-in-time-recovery.

* Incremental backup chains all need to be stored in the same storage.

TBD after https://github.com/percona/percona-server-mongodb-operator/pull/1836 merged.

---

Other significant changes in operator behavior:

* Operator now configures automatically configures PBM on a fresh
  cluster.

  Before this changes, PBM was not configured until user starts a
  backup/restore after deploying a fresh cluster. Now, PBM will directly
  configured with main storage in `cr.yaml` and resync will be started
  in the background.

  There's a new field in `PerconaServerMongoDB` status:
  `backupConfigHash`. Operator will maintain hash of the current PBM
  configuration and reconfigures PBM if hash is changed. Fields in
  `spec.backup.pitr` are excluded from hash calculation, they're handled
  separately.
* If `PerconaServerMongoDB` is annotated with `percona.com/resync-pbm=true`,
  operator will start resync operation both for main storage and
  profiles. Resync for profiles are started with the equivalent of `pbm
  profile sync --all`. These resync operations will be run in the
  background and will not block the reconciliation.
* If a backup that has `percona.com/delete-backup` finalizer is deleted,
  operator will only delete oplogs chunks if it's in main storage.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported MongoDB version?
Does the change support oldest and newest supported Kubernetes version?

gkech · 2025-02-25T11:09:18Z

pkg/apis/psmdb/v1/psmdb_defaults.go

+		if len(cr.Spec.Backup.Storages) > 1 {
+			mainFound := false
+			for _, stg := range cr.Spec.Backup.Storages {
+				if stg.Main {
+					mainFound = true
+				}
+			}
+
+			if !mainFound {
+				return errors.New("main backup storage is not specified")
+			}
+		}


Since we can have multiple storages configured, this also means that we can also have multiple storages configured as main. I think we can increase the scope of this validation and ensure that only one main storage is defined.

gkech · 2025-02-25T11:11:15Z

cmd/manager/main.go

@@ -35,6 +35,7 @@ import (
 var (
 	GitCommit string
 	GitBranch string
+	BuildTime string


Out of curiosity, why is build time needed for our log?

for consistency. we have this in PXC operator and it's useful in case of problems with image caching

gkech · 2025-02-25T11:27:53Z

pkg/apis/psmdb/v1/psmdb_types.go

+		return name, stg, nil
+	}
+
+	return name, stg, errors.New("main storage not found")


We can write this like return "", BackupStorageSpec{}, errors.New("main storage not found") and drop completely the vars at the start of the function. It is more idiomatic.

gkech · 2025-02-25T11:35:30Z

pkg/apis/psmdb/v1/psmdb_types.go

 	AnnotationPVCResizeInProgress = "percona.com/pvc-resize-in-progress"
 )
+
+func (cr *PerconaServerMongoDB) PBMResyncNeeded() bool {


It is totally fine as is, but if we want to be super bulletproof, we can do this

func (cr *PerconaServerMongoDB) PBMResyncNeeded() bool { v, exists := cr.Annotations[AnnotationResyncPBM] return exists && v != "" }

gkech · 2025-02-25T11:37:51Z

pkg/controller/perconaservermongodb/status.go

@@ -299,11 +299,20 @@ func (r *ReconcilePerconaServerMongoDB) rsStatus(ctx context.Context, cr *api.Pe
 	}

 	for _, pod := range list.Items {
+		if pod.DeletionTimestamp != nil {


Recommend we use .IsZero() for similar checks

gkech · 2025-02-25T11:51:29Z

pkg/apis/psmdb/v1/psmdb_types.go

@@ -1052,6 +1054,27 @@ func (b BackupSpec) IsEnabledPITR() bool {
 	return b.PITR.Enabled
 }

+func (b BackupSpec) MainStorage() (string, BackupStorageSpec, error) {


I would love if we added a unit test for this function, a really basic one, mainly because getting the right main storage is super critical not to break and I wouldn't like us to rely on tests that somehow verify this together with other logic.

gkech · 2025-02-25T12:00:14Z

pkg/controller/perconaservermongodb/pbm.go

+	pbm, err := backup.NewPBM(ctx, r.client, cr)
+	if err != nil {
+		return errors.Wrap(err, "create pbm object")
+	}
+	defer pbm.Close(ctx)


Since pbm is used on this scope for the first time, here

if err := enablePiTRIfNeeded(ctx, pbm, cr); err != nil { return errors.Wrap(err, "enable pitr if needed") }

We can save some API calls if the full backup logic needs to return by moving the new pbm creation after the if clause.

gkech · 2025-02-25T12:04:17Z

pkg/controller/perconaservermongodb/pbm.go

+		}
+	}
+
+	if err := enablePiTRIfNeeded(ctx, pbm, cr); err != nil {


🙌🏽 nice, like for these new functions

gkech · 2025-02-25T12:21:06Z

pkg/controller/perconaservermongodb/pbm.go

+
+	// running in separate goroutine to not block reconciliation
+	// until all resync operations finished
+	go func() {


Here we pass the cr object but it may change in another reconciliation loop, right? I think we should handle this scenario with some form of syncing or queuing i.e. if multiple loops occur in a small amount of time.

But maybe if a resync is running, new reconciliations won’t start another routine

nmarukovich · 2025-02-26T08:16:32Z

pkg/psmdb/backup/pbm.go

+func (b *pbmC) ResyncMainStorageAndWait(ctx context.Context) error {
+	if err := b.ResyncMainStorage(ctx); err != nil {
+		return errors.Wrap(err, "start resync")
+	}


Just curious, would it be helpful to add :
start main storage resync

nmarukovich · 2025-02-26T08:36:52Z

pkg/psmdb/backup/pbm.go

+	ticker := time.NewTicker(1 * time.Second)
+	defer ticker.Stop()
+
+	log.Info("waiting for resync to start")


Would it be helpful to users, if we add time that we wait? Does it make sense in this case to make timeouts configurable?

github-actions · 2025-02-26T12:36:24Z

pkg/apis/psmdb/v1/psmdb_types_test.go

+		"single storage": {
+			spec: BackupSpec{
+				Storages: map[string]BackupStorageSpec{
+					"storage-1": BackupStorageSpec{


[gofmt] _{reported by reviewdog 🐶}

Suggested change

"storage-1": BackupStorageSpec{

"storage-1": {

github-actions · 2025-02-26T12:36:24Z

pkg/apis/psmdb/v1/psmdb_types_test.go

+		"multiple storages": {
+			spec: BackupSpec{
+				Storages: map[string]BackupStorageSpec{
+					"storage-1": BackupStorageSpec{


[gofmt] _{reported by reviewdog 🐶}

Suggested change

"storage-1": BackupStorageSpec{

"storage-1": {

github-actions · 2025-02-26T12:36:25Z

pkg/apis/psmdb/v1/psmdb_types_test.go

+						Type: BackupStorageS3,
+						S3:   BackupStorageS3Spec{},
+					},
+					"storage-2": BackupStorageSpec{


[gofmt] _{reported by reviewdog 🐶}

Suggested change

"storage-2": BackupStorageSpec{

"storage-2": {

github-actions · 2025-02-26T12:36:25Z

pkg/apis/psmdb/v1/psmdb_types_test.go

+						Type: BackupStorageS3,
+						S3:   BackupStorageS3Spec{},
+					},
+					"storage-3": BackupStorageSpec{


[gofmt] _{reported by reviewdog 🐶}

Suggested change

"storage-3": BackupStorageSpec{

"storage-3": {

github-actions · 2025-02-26T12:36:45Z

pkg/controller/perconaservermongodb/pbm.go

+	logf "sigs.k8s.io/controller-runtime/pkg/log"
+
+	"github.com/percona/percona-backup-mongodb/pbm/config"
+	psmdbv1 "github.com/percona/percona-server-mongodb-operator/pkg/apis/psmdb/v1"


[goimports-reviser] _{reported by reviewdog 🐶}

Suggested change

psmdbv1 "github.com/percona/percona-server-mongodb-operator/pkg/apis/psmdb/v1"

psmdbv1 "github.com/percona/percona-server-mongodb-operator/pkg/apis/psmdb/v1"

github-actions · 2025-02-28T14:19:22Z

e2e-tests/multi-storage/run

+
+destroy $namespace
+log "test passed"
+


[shfmt] _{reported by reviewdog 🐶}

Suggested change

Operator always supported multiple storages but didn't have native support for having multiple backup storages until v2.6.0. In operator we were reconfiguring PBM every time user selects a storage for their backups/restores different than the previous storage. This was causing long wait periods esp. in storages with lots of backups due to resync operation. Another limitation was forcing users to have only 1 backup storage if they want to enable point-in-time-recovery. In case of multiple storages, PBM would upload oplog chunks to whatever storage is last used by a backup/restore and this would make consistent recovery impossible. PBM v2.6.0 added native support for multiple storages and these changes introduce it to our operator: * User can have one main storage in PBM configuration. Any other storages can be added as profiles. Main storage can be found in: ``` kubectl exec cluster1-rs0-0 -c backup-agent -- pbm config ``` This commit introduces a new field `main` in storage spec: ``` storages: s3-us-west: main: true type: s3 s3: bucket: operator-testing credentialsSecret: cluster1-s3-secrets region: us-west-2 ```` If user only has 1 storage configured in `cr.yaml`, operator will automatically use it as main storage. If more than 1 storage is configured, one of them must have `main: true`. User can't have more than 1 storage with `main: true`. If user changes main storage in cr.yaml, operator will configure PBM with the new storage and start resync. Any other storage in `cr.yaml` will be added to PBM as a profile. User can see profiles using cli: ``` kubectl exec cluster1-rs0-0 -c backup-agent -- pbm profile list ``` When user adds a new profile to `cr.yaml`, operator will add it to PBM but won't start resync. **`pbm config --force-resync` only start resync for the main storage.** To manually resync a profile: ``` kubectl exec cluster1-rs0-0 -c backup-agent -- pbm profile sync <storage-name> ``` If user starts a restore using a backup in a storage configured as profile, operator will start resync operation for profile and block restore until resync finishes. Note: Profiles are also called external storages in PBM documentation. If user has multiple storages in `cr.yaml` and changes main storage between them, operator: 1. configures PBM with the new main storage. 2. adds the old main as a profile. 2. deletes profile for the new main storage. If user configures `backupSource` in backups/restores: * if `cr.yaml` has no storages configured, operator configures PBM with storage data in `backupSource` field. This storage will effectively be the main storage until user adds a storage to `cr.yaml`. After a storage is configured PBM configuration will be overwritten and `backupSource` storage will be gone. * if `cr.yaml` has a storage configured, operator adds `backupSource` storage as a profile. * Oplog chunks will be only be uploaded to main storage. User can use any backup as base backup for point-in-time-recovery. * Incremental backup chains all need to be stored in the same storage. TBD after #1836 merged. --- Other significant changes in operator behavior: * Operator now configures automatically configures PBM on a fresh cluster. Before this changes, PBM was not configured until user starts a backup/restore after deploying a fresh cluster. Now, PBM will directly configured with main storage in `cr.yaml` and resync will be started in the background. There's a new field in `PerconaServerMongoDB` status: `backupConfigHash`. Operator will maintain hash of the current PBM configuration and reconfigures PBM if hash is changed. Fields in `spec.backup.pitr` are excluded from hash calculation, they're handled separately. * If `PerconaServerMongoDB` is annotated with `percona.com/resync-pbm=true`, operator will start resync operation both for main storage and profiles. Resync for profiles are started with the equivalent of `pbm profile sync --all`. These resync operations will be run in the background and will not block the reconciliation. * If a backup that has `percona.com/delete-backup` finalizer is deleted, operator will only delete oplogs chunks if it's in main storage.

github-actions · 2025-02-28T21:45:34Z

e2e-tests/functions


-	run_mongo "use ${database}\n db.${collection}.${command}()" "$uri" "mongodb" "$suffix" \
+	local full_command="db.${collection}.${command}()"
+	if [[ ! -z ${sort} ]]; then


[shfmt] _{reported by reviewdog 🐶}

Suggested change

if [[ ! -z ${sort} ]]; then

if [[ -n ${sort} ]]; then

JNKPercona · 2025-03-01T00:20:23Z

Test name	Status
arbiter	passed
balancer	passed
custom-replset-name	passed
custom-tls	passed
custom-users-roles	passed
custom-users-roles-sharded	passed
cross-site-sharded	passed
data-at-rest-encryption	passed
data-sharded	passed
demand-backup	failure
demand-backup-fs	passed
demand-backup-eks-credentials-irsa	passed
demand-backup-physical	passed
demand-backup-physical-sharded	failure
demand-backup-sharded	passed
expose-sharded	passed
ignore-labels-annotations	passed
init-deploy	passed
finalizer	passed
ldap	passed
ldap-tls	passed
limits	passed
liveness	passed
mongod-major-upgrade	failure
mongod-major-upgrade-sharded	passed
monitoring-2-0	passed
multi-cluster-service	failure
multi-storage	passed
non-voting	passed
one-pod	passed
operator-self-healing-chaos	passed
pitr	passed
pitr-sharded	passed
pitr-physical	passed
preinit-updates	passed
pvc-resize	passed
recover-no-primary	passed
replset-overrides	passed
rs-shard-migration	passed
scaling	passed
scheduled-backup	passed
security-context	passed
self-healing-chaos	passed
service-per-pod	passed
serviceless-external-nodes	passed
smart-update	passed
split-horizon	passed
stable-resource-version	passed
storage	passed
tls-issue-cert-manager	passed
upgrade	passed
upgrade-consistency	passed
upgrade-consistency-sharded-tls	passed
upgrade-sharded	passed
users	failure
version-service	passed
We run 56 out of 56

commit: 15256b4
image: perconalab/percona-server-mongodb-operator:PR-1843-15256b42

pull-request-size bot added the size/XXL 1000+ lines label Feb 24, 2025

hors added this to the v1.20.0 milestone Feb 25, 2025

gkech reviewed Feb 25, 2025

View reviewed changes

nmarukovich reviewed Feb 26, 2025

View reviewed changes

egegunes force-pushed the K8SPSMDB-1219 branch from 1baaa9d to 0f208f4 Compare February 26, 2025 12:36

github-actions bot reviewed Feb 26, 2025

View reviewed changes

github-actions bot added tests build labels Feb 26, 2025

github-actions bot reviewed Feb 26, 2025

View reviewed changes

egegunes force-pushed the K8SPSMDB-1219 branch from 0f208f4 to d8ac896 Compare February 26, 2025 14:38

github-actions bot reviewed Feb 28, 2025

View reviewed changes

egegunes force-pushed the K8SPSMDB-1219 branch from c653f23 to 15256b4 Compare February 28, 2025 21:45

github-actions bot reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPSMDB-1219: PBM multi storage support #1843

K8SPSMDB-1219: PBM multi storage support #1843

egegunes commented Feb 24, 2025 •

edited

Loading

gkech Feb 25, 2025

gkech Feb 25, 2025

egegunes Feb 26, 2025

gkech Feb 25, 2025

gkech Feb 25, 2025

gkech Feb 25, 2025

gkech Feb 25, 2025

gkech Feb 25, 2025 •

edited

Loading

gkech Feb 25, 2025

gkech Feb 25, 2025 •

edited

Loading

gkech Feb 25, 2025

nmarukovich Feb 26, 2025

nmarukovich Feb 26, 2025 •

edited

Loading

github-actions bot Feb 26, 2025

github-actions bot Feb 26, 2025

github-actions bot Feb 26, 2025

github-actions bot Feb 26, 2025

github-actions bot Feb 26, 2025

github-actions bot Feb 28, 2025

github-actions bot Feb 28, 2025

JNKPercona commented Mar 1, 2025

	psmdbv1 "github.com/percona/percona-server-mongodb-operator/pkg/apis/psmdb/v1"

	psmdbv1 "github.com/percona/percona-server-mongodb-operator/pkg/apis/psmdb/v1"

K8SPSMDB-1219: PBM multi storage support #1843

Are you sure you want to change the base?

K8SPSMDB-1219: PBM multi storage support #1843

Conversation

egegunes commented Feb 24, 2025 • edited Loading

CHECKLIST

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkech Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkech Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmarukovich Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot Feb 26, 2025

Choose a reason for hiding this comment

github-actions bot Feb 26, 2025

Choose a reason for hiding this comment

github-actions bot Feb 26, 2025

Choose a reason for hiding this comment

github-actions bot Feb 26, 2025

Choose a reason for hiding this comment

github-actions bot Feb 26, 2025

Choose a reason for hiding this comment

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

JNKPercona commented Mar 1, 2025

egegunes commented Feb 24, 2025 •

edited

Loading

gkech Feb 25, 2025 •

edited

Loading

gkech Feb 25, 2025 •

edited

Loading

nmarukovich Feb 26, 2025 •

edited

Loading