velero pod restarts on backup of cluster resources #3257
Replies: 8 comments 2 replies
-
@carlisia @alaypatel07 Any ideas how to resolve this issue? |
Beta Was this translation helpful? Give feedback.
-
Summary: Velero running in OpenShift 4.5 or 4.6. Backups created excluding secrets works. Backups created not excluding anything gets stuck in InProgress and causes the Velero pod to restart. |
Beta Was this translation helpful? Give feedback.
-
@ghost74-tg What are you using for your volume storage? Is it AWS's EBS, or something else? |
Beta Was this translation helpful? Give feedback.
-
FYI: The issue shows with both vanilla version of Velero as well as for Velero installed through OADP operator incl. the OpenShift plugins |
Beta Was this translation helpful? Give feedback.
-
@nrb We configured velero to use an minio server whith a volume from an external ceph cluster. Provisioning of the volume worked fine and we are able to do backups of namespace-scoped resources. Only cluster-scoped resource backups are causing the velero pod to restart. |
Beta Was this translation helpful? Give feedback.
-
Any idea how we could continue with this issue? |
Beta Was this translation helpful? Give feedback.
-
Running into the same issue on another OCP 4.6.1 cluster with OADP 0.2.0 / Velero 1.5.2 installed to namespace "spp-velero": Issuing the command
Watching the velero pod it shows it crashes/restarts about 30 seconds later:
Attached logs of Velero Pod with debug enabled. Was not able to retrieve logs of the actual backup though:
|
Beta Was this translation helpful? Give feedback.
-
Thanks to a hint from @dymurray it showed that Velero was crashing because of running out of memory. The default limit of 256Mi is just too low to process all objects of the whole cluster. Adding this to the Velero Deployment YAML (via the OADP operator) did the trick: velero_resource_allocation: So doubling the mem limit to 512Mi is sufficient to avoid the Velero container running out of memory. Probably it is a good idea to increase the default to 512Mi, too? |
Beta Was this translation helpful? Give feedback.
-
What steps did you take and what happened:
[A clear and concise description of what the bug is, and what commands you ran.)
velero create backup mybackup-1 -n spp-velero
The backup remains in Status "InProgress" and never reaches Complete state.
When monitoring the pod we see that the pod is restarted during the backup, this might cause the backup to never reach Completion state.
What did you expect to happen:
The velero backup of cluster resources should end with Complete state.
The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
oc logs -p deployment/velero -n spp-velero
time="2020-11-05T18:44:10Z" level=info msg="Setting up backup log" backup=spp-velero/mybackup-1 controller=backup logSource="pkg/controller/backup_controller.go:512"
time="2020-11-05T18:44:10Z" level=info msg="Setting up backup temp file" backup=spp-velero/mybackup-1 logSource="pkg/controller/backup_controller.go:534"
time="2020-11-05T18:44:10Z" level=info msg="Setting up plugin manager" backup=spp-velero/mybackup-1 logSource="pkg/controller/backup_controller.go:541"
time="2020-11-05T18:44:10Z" level=info msg="Getting backup item actions" backup=spp-velero/mybackup-1 logSource="pkg/controller/backup_controller.go:545"
time="2020-11-05T18:44:10Z" level=info msg="Setting up backup store to check for backup existence" backup=spp-velero/mybackup-1 logSource="pkg/controller/backup_controller.go:551"
time="2020-11-05T18:44:10Z" level=info msg="Writing backup version file" backup=spp-velero/mybackup-1 logSource="pkg/backup/backup.go:236"
time="2020-11-05T18:44:10Z" level=info msg="Including namespaces: *" backup=spp-velero/mybackup-1 logSource="pkg/backup/backup.go:242"
time="2020-11-05T18:44:10Z" level=info msg="Excluding namespaces: " backup=spp-velero/mybackup-1 logSource="pkg/backup/backup.go:243"
time="2020-11-05T18:44:10Z" level=info msg="Including resources: *" backup=spp-velero/mybackup-1 logSource="pkg/backup/backup.go:246"
time="2020-11-05T18:44:10Z" level=info msg="Excluding resources: " backup=spp-velero/mybackup-1 logSource="pkg/backup/backup.go:247"
time="2020-11-05T18:44:10Z" level=info msg="Backing up all pod volumes using restic: false" backup=spp-velero/mybackup-1 logSource="pkg/backup/backup.go:248"
time="2020-11-05T18:44:23Z" level=info msg="Getting items for group" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:76"
time="2020-11-05T18:44:23Z" level=info msg="Getting items for resource" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=pods
time="2020-11-05T18:44:23Z" level=info msg="Listing items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=pods
time="2020-11-05T18:44:24Z" level=info msg="Retrieved 215 items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=pods
time="2020-11-05T18:44:24Z" level=info msg="Getting items for resource" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=persistentvolumeclaims
time="2020-11-05T18:44:24Z" level=info msg="Listing items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=persistentvolumeclaims
time="2020-11-05T18:44:24Z" level=info msg="Retrieved 3 items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=persistentvolumeclaims
time="2020-11-05T18:44:24Z" level=info msg="Getting items for resource" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=persistentvolumes
time="2020-11-05T18:44:24Z" level=info msg="Listing items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=persistentvolumes
time="2020-11-05T18:44:24Z" level=info msg="Retrieved 3 items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=persistentvolumes
time="2020-11-05T18:44:24Z" level=info msg="Getting items for resource" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=namespaces
time="2020-11-05T18:44:24Z" level=info msg="Listing items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=namespaces
time="2020-11-05T18:44:24Z" level=info msg="Retrieved 64 items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=namespaces
time="2020-11-05T18:44:24Z" level=info msg="Getting items for resource" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=events
time="2020-11-05T18:44:24Z" level=info msg="Listing items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=events
time="2020-11-05T18:44:24Z" level=info msg="Retrieved 2148 items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=events
time="2020-11-05T18:44:25Z" level=info msg="Getting items for resource" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=secrets
time="2020-11-05T18:44:25Z" level=info msg="Listing items" backup=spp-velero/mybackup-1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=secrets
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup get -n spp-velero
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
mybackup-1 InProgress 0 0 2020-11-05 19:44:10 +0100 CET 29d default
[root@nevada16 install]# velero backup describe mybackup-1 -n spp-velero
Name: mybackup-1
Namespace: spp-velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.18.3+2fbd7c7
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=18+
Phase: InProgress
Errors: 0
Warnings: 0
Namespaces:
Included: *
Excluded:
Resources:
Included: *
Excluded:
Cluster-scoped: auto
Label selector:
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2020-11-05 19:44:10 +0100 CET
Completed: <n/a>
Expiration: 2020-12-05 19:44:10 +0100 CET
Velero-Native Snapshots:
velero backup logs <backupname>
velero backup logs mybackup-1 -n spp-velero
Logs for backup "mybackup-1" are not available until it's finished processing. Please wait until the backup has a phase of Completed or Failed and try again.
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The pod is always restarting after or on retrieving the resource secrets. This is reproducable.
A backup with exclude secrets works fine.
velero create backup mybackup-2 --exclude-resources secrets -n spp-velero
velero get backup -n spp-velero
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
mybackup-1 InProgress 0 0 2020-11-05 19:44:10 +0100 CET 29d default
mybackup-2 Completed 0 1 2020-11-05 19:48:35 +0100 CET 29d default
Environment:
OCP 4.5.15, we see the same behavior on 4.5.6 and 4.6.1
Velero version (use
velero version
): 1.5.2 and 1.4.3Velero features (use
velero client config get features
):Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2-0-g52c56ce", GitCommit:"b66f2d3a6893be729f1b8660309a59c6e0b69196", GitTreeState:"clean", BuildDate:"2020-08-10T04:49:09Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.3+2fbd7c7", GitCommit:"2fbd7c7", GitTreeState:"clean", BuildDate:"2020-10-09T11:41:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from
/etc/os-release
): CoreOS 4.5Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
Beta Was this translation helpful? Give feedback.
All reactions