Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

velero_backup_last_status indicates failed backup, while in reality the backup was successful #6809

Closed
vvanouytsel opened this issue Sep 12, 2023 · 15 comments · Fixed by #6838
Assignees
Labels
Metrics Related to prometheus metrics

Comments

@vvanouytsel
Copy link

What steps did you take and what happened:

We have a Schedule that takes a backup every night which is named velero-daily. When we deployed a new AMI to our cluster, all our nodes were replaced with the new AMI version, effectively meaning that the velero pod moved to a new node. After that we noticed that the velero_backup_last_status{schedule="velero-daily"} metric indicated that the last backup failed. While in reality the backup was successful.

What did you expect to happen:
I expected the velero_backup_last_status metric to output a 1 instead of a 0, because the backup was successful.

When querying the velero_backup_last_status metric, we get a 0 indicating that the backup failed, thus triggering our alert.
However note that the velero_backup_last_successful_timestamp shows that the backup was actually successful, so both metrics contradict each other.

# HELP velero_backup_last_status Last status of the backup. A value of 1 is success, 0 is failure
# TYPE velero_backup_last_status gauge
velero_backup_last_status{schedule=""} 0
velero_backup_last_status{schedule="velero-daily"} 0

# HELP velero_backup_last_successful_timestamp Last time a backup ran successfully, Unix timestamp in seconds
# TYPE velero_backup_last_successful_timestamp gauge
velero_backup_last_successful_timestamp{schedule="velero-daily"} 1.694484087e+09

The CLI shows that the backup has completed with no errors/warnings.

❯ velero get backup
NAME                          STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
velero-daily-20230912020053   Completed   0        0          2023-09-12 04:00:53 +0200 CEST   29d       default            <none>

The Backup object also shows that the backup was taken successfully.

❯ velero get backup -o yaml velero-daily-20230912020053
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-delete-policy: before-hook-creation
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"velero.io/v1","kind":"Schedule","metadata":{"annotations":{"helm.sh/hook":"post-install,post-upgrade,post-rollback","helm.sh/hook-delete-policy":"before-hook-creation"},"labels":{"app.kubernetes.io/instance":"velero","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"velero","argocd.argoproj.io/instance":"velero","helm.sh/chart":"velero-4.4.1"},"name":"velero-daily","namespace":"velero"},"spec":{"schedule":"0 2 * * *","template":{"defaultVolumesToFsBackup":false,"storageLocation":"default","ttl":"720h","volumeSnapshotLocations":["default"]}}}
    velero.io/source-cluster-k8s-gitversion: v1.27.4-eks-2d98532
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: 27+
  creationTimestamp: "2023-09-12T02:00:53Z"
  generation: 24
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    argocd.argoproj.io/instance: velero
    helm.sh/chart: velero-4.4.1
    touch: me
    velero.io/schedule-name: velero-daily
    velero.io/storage-location: default
  managedFields:
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:helm.sh/hook: {}
          f:helm.sh/hook-delete-policy: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:velero.io/source-cluster-k8s-gitversion: {}
          f:velero.io/source-cluster-k8s-major-version: {}
          f:velero.io/source-cluster-k8s-minor-version: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:argocd.argoproj.io/instance: {}
          f:helm.sh/chart: {}
          f:velero.io/schedule-name: {}
          f:velero.io/storage-location: {}
      f:spec:
        .: {}
        f:csiSnapshotTimeout: {}
        f:defaultVolumesToFsBackup: {}
        f:hooks: {}
        f:itemOperationTimeout: {}
        f:metadata: {}
        f:storageLocation: {}
        f:ttl: {}
        f:volumeSnapshotLocations: {}
      f:status:
        .: {}
        f:completionTimestamp: {}
        f:expiration: {}
        f:formatVersion: {}
        f:phase: {}
        f:progress:
          .: {}
          f:itemsBackedUp: {}
          f:totalItems: {}
        f:startTimestamp: {}
        f:version: {}
        f:volumeSnapshotsAttempted: {}
        f:volumeSnapshotsCompleted: {}
    manager: velero-server
    operation: Update
    time: "2023-09-12T02:01:27Z"
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:touch: {}
    manager: kubectl-edit
    operation: Update
    time: "2023-09-12T07:34:28Z"
  name: velero-daily-20230912020053
  namespace: velero
  resourceVersion: "24185233"
  uid: 4bb4508b-30df-4bc5-8f22-db4af3de0938
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  hooks: {}
  itemOperationTimeout: 1h0m0s
  metadata: {}
  storageLocation: default
  ttl: 720h0m0s
  volumeSnapshotLocations:
  - default
status:
  completionTimestamp: "2023-09-12T02:01:27Z"
  expiration: "2023-10-12T02:00:53Z"
  formatVersion: 1.1.0
  phase: Completed
  progress:
    itemsBackedUp: 2103
    totalItems: 2103
  startTimestamp: "2023-09-12T02:00:53Z"
  version: 1
  volumeSnapshotsAttempted: 19
  volumeSnapshotsCompleted: 19

Environment:

  • Velero version (use velero version):
❯ velero version
Client:
	Version: v1.11.1
	Git commit: bdbe7eb242b0f64d5b04a7fea86d1edbb3a3587c
Server:
	Version: v1.11.1
  • Velero features (use velero client config get features):
❯ velero client config get features
features: <NOT SET> 
  • Kubernetes version (use kubectl version):
❯ kubectl version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4-eks-2d98532
  • Cloud provider or hardware configuration:
    EKS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@vvanouytsel
Copy link
Author

velero debug --backup velero-daily-20230912020053 resulted in the following tar.gz file:
bundle-2023-09-12-10-33-35.tar.gz

@Ahmad-Faizan
Copy link
Contributor

This issue is happening to us also. The metric velero_backup_last_status does not get updated even if the backups are successful.
This is however erratic in nature.

@ndriessen
Copy link

We are also running into this, very annoying as we get a lot of false alerts now on our backups

@yanggangtony
Copy link
Contributor

yanggangtony commented Sep 13, 2023

@vvanouytsel
Hi , i search the logs , but not found the key point about the issue.
Would you post the schedule_controller.go logs too?

Because the velero_backup_last_status=0 will be init in the schedule controller loops.
Like c.metrics.InitSchedule(schedule.Name).

@yanggangtony
Copy link
Contributor

And i also found the generate backup 's time is odd.
Like you post file : velero-bundle/kubecapture/velero.io_v1/velero/backups-202309121033.1507.json.

The backup:

"apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-14T02:00:38Z",
                "generation": 30,
                "labels": {
 "apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-14T06:48:09Z",
                "generation": 27,
  "apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-24T02:00:10Z",
                "generation": 24,
    "apiVersion": "velero.io/v1",
            "kind": "Backup",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                    "velero.io/source-cluster-k8s-gitversion": "v1.27.4-eks-2d98532",
                    "velero.io/source-cluster-k8s-major-version": "1",
                    "velero.io/source-cluster-k8s-minor-version": "27+"
                },
                "creationTimestamp": "2023-08-24T09:53:02Z",
                "generation": 24,

@yanggangtony
Copy link
Contributor

Its not the cron time as usual like the scheduler defines.:

"apiVersion": "velero.io/v1",
            "kind": "Schedule",
            "metadata": {
                "annotations": {
                    "helm.sh/hook": "post-install,post-upgrade,post-rollback",
                    "helm.sh/hook-delete-policy": "before-hook-creation",
                },
                "creationTimestamp": "2023-08-22T09:02:15Z",
                "generation": 23,
                "labels": {
                    "app.kubernetes.io/instance": "velero",
                    "app.kubernetes.io/managed-by": "Helm",
                    "app.kubernetes.io/name": "velero",
                    "argocd.argoproj.io/instance": "velero",
                    "helm.sh/chart": "velero-4.4.1"
                },
                "name": "velero-daily",
                "namespace": "velero",
                "resourceVersion": "24020310",
                "uid": "bd0844a7-e9b7-4130-851e-60d656e6ab38"
            },
            "spec": {
                "schedule": "0 2 * * *",
                "template": {
                    "defaultVolumesToFsBackup": false,
                    "storageLocation": "default",
                    "ttl": "720h",
                    "volumeSnapshotLocations": [
                        "default"
                    ]
                }
            },
            "status": {
                "lastBackup": "2023-09-12T02:00:53Z",
                "phase": "Enabled"
            }

@Lyndon-Li Lyndon-Li added the Metrics Related to prometheus metrics label Sep 18, 2023
@vvanouytsel
Copy link
Author

@yanggangtony , unfortunately I do not have access to those logs anymore.

The weird thing is that the velero_backup_last_successful_timestamp points to a backup with timestamp 1.694484087e+09, which is Tue Sep 12 2023 02:01:27 GMT+0000. So according to the velero_backup_last_successful_timestamp and the Backup CRD itself (and the CLI), that backup is successful. It is only the velero_backup_last_status metric that claims that that backup was not successful.

# HELP velero_backup_last_successful_timestamp Last time a backup ran successfully, Unix timestamp in seconds
# TYPE velero_backup_last_successful_timestamp gauge
velero_backup_last_successful_timestamp{schedule="velero-daily"} 1.694484087e+09

@allenxu404
Copy link
Contributor

The default value of velero_backup_last_status metric is 0 which indicate failure. As long as the code exits without updating the metrics, it will reports failure.

The better way to handle velero_backup_last_status metric is to set the default value to 1, indicating success, then change the value to 0 if the code exit with a failure.

@vvanouytsel
Copy link
Author

@allenxu404 you might actually be on to something.
The problem I described only occurred after a restart of the velero pod.

@yanggangtony
Copy link
Contributor

The default value of velero_backup_last_status metric is 0 which indicate failure. As long as the code exits without updating the metrics, it will reports failure.

The better way to handle velero_backup_last_status metric is to set the default value to 1, indicating success, then change the value to 0 if the code exit with a failure.

In the normal way , backup schedule begin a backup cr, the backup cr was watchd, and begin to run. and report the status of it. the code is right.
Change the default value to 1, can reduce the potential for the same issue.

How about we just change the default value from 0-->1?

@yanggangtony
Copy link
Contributor

@allenxu404
How do you think about , change the default value?
If that ok , i can take the work for the change logic.

@yanggangtony
Copy link
Contributor

But there is another issue may occur, the metricsvelero_backup_last_status will always be 1, if the backup never begins..

@yanggangtony
Copy link
Contributor

yanggangtony commented Sep 18, 2023

The problem I described only occurred after a restart of the velero pod.
@vvanouytsel
So it is the clue , i post before, why the logs is so strange.. #6809 (comment)

@allenxu404
Copy link
Contributor

After briefly examining the code, it appears that the metrics velero_backup_last_status are being updated correctly around all relevant sections. Therefore, the issue seems to arise when the code exits without updating the metrics.

@yanggangtony Yes, I think it's better to change the default value to 1, indicating the success. And it doesn't hurt the metrics indicates success when there is no backup initiated.

@yanggangtony
Copy link
Contributor

@allenxu404
Thanks for the code re-confirm.
So i will take the issue, and change the default value to 1 ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metrics Related to prometheus metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants