velero_backup_last_status indicates failed backup, while in reality the backup was successful #6809

vvanouytsel · 2023-09-12T08:39:19Z

What steps did you take and what happened:

We have a Schedule that takes a backup every night which is named velero-daily. When we deployed a new AMI to our cluster, all our nodes were replaced with the new AMI version, effectively meaning that the velero pod moved to a new node. After that we noticed that the velero_backup_last_status{schedule="velero-daily"} metric indicated that the last backup failed. While in reality the backup was successful.

What did you expect to happen:
I expected the velero_backup_last_status metric to output a 1 instead of a 0, because the backup was successful.

When querying the velero_backup_last_status metric, we get a 0 indicating that the backup failed, thus triggering our alert.
However note that the velero_backup_last_successful_timestamp shows that the backup was actually successful, so both metrics contradict each other.

# HELP velero_backup_last_status Last status of the backup. A value of 1 is success, 0 is failure
# TYPE velero_backup_last_status gauge
velero_backup_last_status{schedule=""} 0
velero_backup_last_status{schedule="velero-daily"} 0

# HELP velero_backup_last_successful_timestamp Last time a backup ran successfully, Unix timestamp in seconds
# TYPE velero_backup_last_successful_timestamp gauge
velero_backup_last_successful_timestamp{schedule="velero-daily"} 1.694484087e+09

The CLI shows that the backup has completed with no errors/warnings.

❯ velero get backup
NAME                          STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
velero-daily-20230912020053   Completed   0        0          2023-09-12 04:00:53 +0200 CEST   29d       default            <none>

The Backup object also shows that the backup was taken successfully.

❯ velero get backup -o yaml velero-daily-20230912020053
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-delete-policy: before-hook-creation
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"velero.io/v1","kind":"Schedule","metadata":{"annotations":{"helm.sh/hook":"post-install,post-upgrade,post-rollback","helm.sh/hook-delete-policy":"before-hook-creation"},"labels":{"app.kubernetes.io/instance":"velero","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"velero","argocd.argoproj.io/instance":"velero","helm.sh/chart":"velero-4.4.1"},"name":"velero-daily","namespace":"velero"},"spec":{"schedule":"0 2 * * *","template":{"defaultVolumesToFsBackup":false,"storageLocation":"default","ttl":"720h","volumeSnapshotLocations":["default"]}}}
    velero.io/source-cluster-k8s-gitversion: v1.27.4-eks-2d98532
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: 27+
  creationTimestamp: "2023-09-12T02:00:53Z"
  generation: 24
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    argocd.argoproj.io/instance: velero
    helm.sh/chart: velero-4.4.1
    touch: me
    velero.io/schedule-name: velero-daily
    velero.io/storage-location: default
  managedFields:
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:helm.sh/hook: {}
          f:helm.sh/hook-delete-policy: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:velero.io/source-cluster-k8s-gitversion: {}
          f:velero.io/source-cluster-k8s-major-version: {}
          f:velero.io/source-cluster-k8s-minor-version: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:argocd.argoproj.io/instance: {}
          f:helm.sh/chart: {}
          f:velero.io/schedule-name: {}
          f:velero.io/storage-location: {}
      f:spec:
        .: {}
        f:csiSnapshotTimeout: {}
        f:defaultVolumesToFsBackup: {}
        f:hooks: {}
        f:itemOperationTimeout: {}
        f:metadata: {}
        f:storageLocation: {}
        f:ttl: {}
        f:volumeSnapshotLocations: {}
      f:status:
        .: {}
        f:completionTimestamp: {}
        f:expiration: {}
        f:formatVersion: {}
        f:phase: {}
        f:progress:
          .: {}
          f:itemsBackedUp: {}
          f:totalItems: {}
        f:startTimestamp: {}
        f:version: {}
        f:volumeSnapshotsAttempted: {}
        f:volumeSnapshotsCompleted: {}
    manager: velero-server
    operation: Update
    time: "2023-09-12T02:01:27Z"
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:touch: {}
    manager: kubectl-edit
    operation: Update
    time: "2023-09-12T07:34:28Z"
  name: velero-daily-20230912020053
  namespace: velero
  resourceVersion: "24185233"
  uid: 4bb4508b-30df-4bc5-8f22-db4af3de0938
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  hooks: {}
  itemOperationTimeout: 1h0m0s
  metadata: {}
  storageLocation: default
  ttl: 720h0m0s
  volumeSnapshotLocations:
  - default
status:
  completionTimestamp: "2023-09-12T02:01:27Z"
  expiration: "2023-10-12T02:00:53Z"
  formatVersion: 1.1.0
  phase: Completed
  progress:
    itemsBackedUp: 2103
    totalItems: 2103
  startTimestamp: "2023-09-12T02:00:53Z"
  version: 1
  volumeSnapshotsAttempted: 19
  volumeSnapshotsCompleted: 19

Environment:

Velero version (use velero version):

❯ velero version
Client:
	Version: v1.11.1
	Git commit: bdbe7eb242b0f64d5b04a7fea86d1edbb3a3587c
Server:
	Version: v1.11.1

Velero features (use velero client config get features):

❯ velero client config get features
features: <NOT SET>

Kubernetes version (use kubectl version):

❯ kubectl version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4-eks-2d98532

Cloud provider or hardware configuration:
EKS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

vvanouytsel · 2023-09-12T08:40:40Z

velero debug --backup velero-daily-20230912020053 resulted in the following tar.gz file:
bundle-2023-09-12-10-33-35.tar.gz

Ahmad-Faizan · 2023-09-12T15:50:41Z

This issue is happening to us also. The metric velero_backup_last_status does not get updated even if the backups are successful.
This is however erratic in nature.

ndriessen · 2023-09-13T07:45:32Z

We are also running into this, very annoying as we get a lot of false alerts now on our backups

yanggangtony · 2023-09-13T16:43:16Z

@vvanouytsel
Hi , i search the logs , but not found the key point about the issue.
Would you post the schedule_controller.go logs too?

Because the velero_backup_last_status=0 will be init in the schedule controller loops.
Like c.metrics.InitSchedule(schedule.Name).

yanggangtony · 2023-09-13T16:53:51Z

And i also found the generate backup 's time is odd.
Like you post file : velero-bundle/kubecapture/velero.io_v1/velero/backups-202309121033.1507.json.

The backup:

"apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-14T02:00:38Z",
                "generation": 30,
                "labels": {

 "apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-14T06:48:09Z",
                "generation": 27,

  "apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-24T02:00:10Z",
                "generation": 24,

    "apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-24T09:53:02Z",
                "generation": 24,

yanggangtony · 2023-09-13T16:56:39Z

Its not the cron time as usual like the scheduler defines.:

"apiVersion": "velero.io/v1",
            "kind": "Schedule",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                },
                "creationTimestamp": "2023-08-22T09:02:15Z",
                "generation": 23,
                "labels": {
                    "app.kubernetes.io/instance": "velero",
                    "app.kubernetes.io/managed-by": "Helm",
                    "app.kubernetes.io/name": "velero",
                    "argocd.argoproj.io/instance": "velero",
                    "helm.sh/chart": "velero-4.4.1"
                },
                "name": "velero-daily",
                "namespace": "velero",
                "resourceVersion": "24020310",
                "uid": "bd0844a7-e9b7-4130-851e-60d656e6ab38"
            },
            "spec": {
                "schedule": "0 2 * * *",
                "template": {
                    "defaultVolumesToFsBackup": false,
                    "storageLocation": "default",
                    "ttl": "720h",
                    "volumeSnapshotLocations": [
                        "default"
                    ]
                }
            },
            "status": {
                "lastBackup": "2023-09-12T02:00:53Z",
                "phase": "Enabled"
            }

vvanouytsel · 2023-09-18T06:59:38Z

@yanggangtony , unfortunately I do not have access to those logs anymore.

The weird thing is that the velero_backup_last_successful_timestamp points to a backup with timestamp 1.694484087e+09, which is Tue Sep 12 2023 02:01:27 GMT+0000. So according to the velero_backup_last_successful_timestamp and the Backup CRD itself (and the CLI), that backup is successful. It is only the velero_backup_last_status metric that claims that that backup was not successful.

# HELP velero_backup_last_successful_timestamp Last time a backup ran successfully, Unix timestamp in seconds
# TYPE velero_backup_last_successful_timestamp gauge
velero_backup_last_successful_timestamp{schedule="velero-daily"} 1.694484087e+09

allenxu404 · 2023-09-18T07:07:31Z

The default value of velero_backup_last_status metric is 0 which indicate failure. As long as the code exits without updating the metrics, it will reports failure.

The better way to handle velero_backup_last_status metric is to set the default value to 1, indicating success, then change the value to 0 if the code exit with a failure.

vvanouytsel · 2023-09-18T07:14:43Z

@allenxu404 you might actually be on to something.
The problem I described only occurred after a restart of the velero pod.

yanggangtony · 2023-09-18T10:32:07Z

The default value of velero_backup_last_status metric is 0 which indicate failure. As long as the code exits without updating the metrics, it will reports failure.

The better way to handle velero_backup_last_status metric is to set the default value to 1, indicating success, then change the value to 0 if the code exit with a failure.

In the normal way , backup schedule begin a backup cr, the backup cr was watchd, and begin to run. and report the status of it. the code is right.
Change the default value to 1, can reduce the potential for the same issue.

How about we just change the default value from 0-->1?

yanggangtony · 2023-09-18T10:36:33Z

@allenxu404
How do you think about , change the default value?
If that ok , i can take the work for the change logic.

yanggangtony · 2023-09-18T10:46:23Z

But there is another issue may occur, the metricsvelero_backup_last_status will always be 1, if the backup never begins..

yanggangtony · 2023-09-18T10:48:04Z

The problem I described only occurred after a restart of the velero pod.
@vvanouytsel
So it is the clue , i post before, why the logs is so strange.. #6809 (comment)

allenxu404 · 2023-09-19T02:44:56Z

After briefly examining the code, it appears that the metrics velero_backup_last_status are being updated correctly around all relevant sections. Therefore, the issue seems to arise when the code exits without updating the metrics.

@yanggangtony Yes, I think it's better to change the default value to 1, indicating the success. And it doesn't hurt the metrics indicates success when there is no backup initiated.

yanggangtony · 2023-09-19T06:59:02Z

@allenxu404
Thanks for the code re-confirm.
So i will take the issue, and change the default value to 1 ..

Lyndon-Li added the Metrics Related to prometheus metrics label Sep 18, 2023

reasonerjt assigned allenxu404 Sep 18, 2023

yanggangtony mentioned this issue Sep 19, 2023

Change the default value of the velero_backup_last_status metrics. #6838

Merged

3 tasks

ywk253100 closed this as completed in #6838 Sep 20, 2023

yanggangtony mentioned this issue Oct 14, 2023

Velero fails to expose correct backup metrics after a pod restart #6936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

velero_backup_last_status indicates failed backup, while in reality the backup was successful #6809

velero_backup_last_status indicates failed backup, while in reality the backup was successful #6809

vvanouytsel commented Sep 12, 2023

vvanouytsel commented Sep 12, 2023

Ahmad-Faizan commented Sep 12, 2023

ndriessen commented Sep 13, 2023

yanggangtony commented Sep 13, 2023 •

edited

Loading

yanggangtony commented Sep 13, 2023

yanggangtony commented Sep 13, 2023

vvanouytsel commented Sep 18, 2023

allenxu404 commented Sep 18, 2023

vvanouytsel commented Sep 18, 2023

yanggangtony commented Sep 18, 2023

yanggangtony commented Sep 18, 2023

yanggangtony commented Sep 18, 2023

yanggangtony commented Sep 18, 2023 •

edited

Loading

allenxu404 commented Sep 19, 2023

yanggangtony commented Sep 19, 2023

velero_backup_last_status indicates failed backup, while in reality the backup was successful #6809

velero_backup_last_status indicates failed backup, while in reality the backup was successful #6809

Comments

vvanouytsel commented Sep 12, 2023

vvanouytsel commented Sep 12, 2023

Ahmad-Faizan commented Sep 12, 2023

ndriessen commented Sep 13, 2023

yanggangtony commented Sep 13, 2023 • edited Loading

yanggangtony commented Sep 13, 2023

yanggangtony commented Sep 13, 2023

vvanouytsel commented Sep 18, 2023

allenxu404 commented Sep 18, 2023

vvanouytsel commented Sep 18, 2023

yanggangtony commented Sep 18, 2023

yanggangtony commented Sep 18, 2023

yanggangtony commented Sep 18, 2023

yanggangtony commented Sep 18, 2023 • edited Loading

allenxu404 commented Sep 19, 2023

yanggangtony commented Sep 19, 2023

yanggangtony commented Sep 13, 2023 •

edited

Loading

yanggangtony commented Sep 18, 2023 •

edited

Loading