Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

Open
erichevers opened this issue Nov 22, 2024 · 14 comments
Assignees
Labels
area/datamover Needs info Waiting for information

Comments

@erichevers
Copy link

erichevers commented Nov 22, 2024

What steps did you take and what happened:
I do a restore from a backup made from a pvc on rook-ceph on cluster prod01 to cluster dR01 with:

  • velero restore create restore-test --include-namespaces vaicloud-dev --from-backup vaicloud-dev-backup22112024-2
    The backup was made using --snapshot-move-data to S3 compatible storage

What did you expect to happen:
The restore to succeed, but i get the following error:

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    vaicloud-dev:  error preparing volumesnapshots.snapshot.storage.k8s.io/vaicloud-dev/cephfs-pvc-snapshot: rpc error: code = Unknown desc = VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation

I have set the requested annotation on rbd and cephfs, on both clusters. Also the volumes that need to be restored are using the rook-ceph-block storageclass, not the cephfs as the failure message indicates. So i'm wondering why this restore fails with a reference to cephfs

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2024-11-22-16-16-21.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version): Version 1.15.0 on both clusters
  • Velero features (use velero client config get features): features: EnableCSI
  • Kubernetes version (use kubectl version): 1.30.0 on the prod01 cluster (backup) and 1.31.1 on the dr01 (restore)
  • Kubernetes installer & version: RKE2
  • Cloud provider or hardware configuration: Bare-metal
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet
Copy link
Contributor

This is not expected.
The restore's referenced backup already enabled the SnapshotMoveData flag, then the restore should not use the CSI plugins to restore the volume data.

@blackpiglet
Copy link
Contributor

@erichevers
Could you help collect the debug bundle of the restore-referenced backup?
IMO, the error happened due to the restore trying to restore the VolumeSnapshot CR.
This is not expected.

  • If the VolumeSnapshot was created during the backup, the correct behavior is the VolumeSnapshot should be deleted during the backup. There should be something going wrong.
  • The VolumeSnapshot was there before Velero did the backup. If this is the case, there is also something unexpected. The VolumeSnapshot should be updated the needed information by the CSI BIA.
time="2024-11-22T15:01:11Z" level=info msg="Executing item action for volumesnapshots.snapshot.storage.k8s.io" logSource="pkg/restore/restore.go:1321" restore=velero/restore-test
time="2024-11-22T15:01:11Z" level=info msg="Starting VolumeSnapshotRestoreItemAction" cmd=/velero logSource="pkg/restore/actions/csi/volumesnapshot_action.go:78" pluginName=velero restore=velero/restore-test

@erichevers
Copy link
Author

Hi @blackpiglet ,
Thanks for looking into this. The debug backup logs are here:
bundle-2024-11-25-10-16-20.tar.gz

@blackpiglet
Copy link
Contributor

Thanks for collecting the debug bundle.

There were three VolumeSnapshots included in the backup, and they are not created by the backup.

  snapshot.storage.k8s.io/v1/VolumeSnapshot:
    - vaicloud-dev/cephfs-pvc-snapshot
    - vaicloud-dev/velero-vaicloud-mq-volume-rsjnj
    - vaicloud-dev/velero-vaicloud-postgresql-volume-bgfhz

Velero also run the VolumeSnapshot BackupItemAction against them.

The only reason the restore failed to restore the VolumeSnapshots is that the backup-included VolumeSnapshots didn't have the Status or the Status didn't contain the SnapshotHandle during running backup.
Could you please check the content of those VolumeSnapshots?

@erichevers
Copy link
Author

erichevers commented Nov 25, 2024

Hi @blackpiglet ,

On the Prod01 cluster i've checked the volumesnapshots and indeed there are three

kubectl get volumesnapshots
NAME                                      READYTOUSE   SOURCEPVC                    SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                SNAPSHOTCONTENT                                    CREATIONTIME   AGE
cephfs-pvc-snapshot                       false        cephfs-pvc                                                         csi-cephfsplugin-snapclass                                                                     41d
velero-vaicloud-mq-volume-rsjnj           true         vaicloud-mq-volume                                   2Gi           csi-rbdplugin-snapclass      snapcontent-131fbc1b-92f6-4980-87e9-997d4aef74c3   3d7h           3d7h
velero-vaicloud-postgresql-volume-bgfhz   true         vaicloud-postgresql-volume                           30Gi          csi-rbdplugin-snapclass      snapcontent-89dd62b0-7dae-4297-83cb-dc4d1b97db86   3d7h           3d7h

I don't know where the cephfs snapshot is coming from, but the describe shows that it in a failed state.

kubectl describe volumesnapshot cephfs-pvc-snapshot
Status:
  Error:
    Message:     Failed to create snapshot content with error snapshot controller failed to update cephfs-pvc-snapshot on API server: cannot get claim from snapshot
    Time:        2024-10-15T13:35:37Z
  Ready To Use:  false

I also looked at the other two snapshots and they came from another backup.
I've delete all three volumesnapshot and started a new backup.
velero backup create vaicloud-dev-backup22112024-4 --include-namespaces vaicloud-dev --snapshot-move-

datasnapshots got created and removed, as it should be
Then i moved over to the dr01 cluster and did a restore

velero restore create restore-test --include-namespaces vaicloud-dev --from-backup vaicloud-dev-backup22112024-4

This time there was no messsage about the VolumeSnapshot as the original case. However the restore job stays in:
WaitingForPluginOperations

below is the debug logfile of the restore:
bundle-2024-11-25-21-09-59.tar.gz

Regards

@blackpiglet
Copy link
Contributor

From the log, I think the restore worked as expected.
How did the restore take to complete?

The data mover restore may take longer time than the CSI snapshot restore, because the data mover restore needs to create temporary pod and PVC to host the restored data.
The restore time also depends on the amount of the restored volume data.

@blackpiglet
Copy link
Contributor

blackpiglet commented Nov 26, 2024

Make some clarification about the scenario of this issue:

  • The error reported by this issue is not a common case for the Velero CSI snapshot data mover.
  • The error was triggered by a failed VolumeSnapshot. The VolumeSnapshot didn't have a snapshot handle, and it was not created by the restore-referenced backup.

Although this is a rainy-day case, we may also consider whether Velero should handle it instead of reporting error.

@blackpiglet blackpiglet assigned blackpiglet and unassigned Lyndon-Li Nov 26, 2024
@erichevers
Copy link
Author

erichevers commented Nov 26, 2024

From the log, I think the restore worked as expected. How did the restore take to complete?

The data mover restore may take longer time than the CSI snapshot restore, because the data mover restore needs to create temporary pod and PVC to host the restored data. The restore time also depends on the amount of the restored volume data.

Hi @blackpiglet ,
I just checked and the job failed after the standard 4 hour timeout, with the following error:

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    vaicloud-dev:  fail to patch dynamic PV, err: context deadline exceeded, PVC: vaicloud-postgresql-volume, PV: pvc-59e5c022-2214-4ef4-a24b-f8afff278041
                   fail to patch dynamic PV, err: context deadline exceeded, PVC: vaicloud-mq-volume, PV: pvc-b6de9f16-8a9d-41ae-ae2e-a5fc715377c0

And the pods are still in pending

Regards

@blackpiglet
Copy link
Contributor

@erichevers
I found a similar issue #7866
Could you check the PVC vaicloud-postgresql-volume and vaicloud-mq-volume status?
IMO, they were not ended with Bound phase after created by the restore.

@erichevers
Copy link
Author

erichevers commented Nov 26, 2024

@blackpiglet ,
The PVC's are in Pending.

kubectl describe pvc vaicloud-postgresql-volume -n vaicloud-dev
gives:
Name:          vaicloud-postgresql-volume
Namespace:     vaicloud-dev
StorageClass:  rook-ceph-block
Status:        Pending
Volume:        
Labels:        velero.io/backup-name=vaicloud-dev-backup22112024-4
               velero.io/restore-name=restore-test
               velero.io/volume-snapshot-name=velero-vaicloud-postgresql-volume-jkmhl
Annotations:   backup.velero.io/must-include-additional-items: true
               velero.io/csi-volumesnapshot-class: csi-rbdplugin-snapclass
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       vaicloud-db-7846b4c4cd-25k8w
Events:
  Type    Reason                Age                     From                                                                                                        Message
  ----    ------                ----                    ----                                                                                                        -------
  Normal  ExternalProvisioning  3m53s (x2923 over 12h)  persistentvolume-controller                                                                                 Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning          27s (x205 over 12h)     rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-54b4855f96-b95cx_0ee214d7-199b-4fcc-8748-dfa6b513df21  External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-postgresql-volume"

Regards

@blackpiglet blackpiglet added Needs info Waiting for information and removed Needs investigation labels Nov 26, 2024
@blackpiglet
Copy link
Contributor

Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-postgresql-volume"

To me, the error should be related to the Ceph Rook not creating a volume for the PVC in time.
Could you check whether there was any error logs in the Ceph Rook pods?

@blackpiglet
Copy link
Contributor

Make some clarification about the scenario of this issue:

  • The error reported by this issue is not a common case for the Velero CSI snapshot data mover.
  • The error was triggered by a failed VolumeSnapshot. The VolumeSnapshot didn't have a snapshot handle, and it was not created by the restore-referenced backup.

Although this is a rainy-day case, we may also consider whether Velero should handle it instead of reporting error.

Create a new issue #8460 to address this comment.

@erichevers
Copy link
Author

erichevers commented Nov 26, 2024

Hi @blackpiglet ,

I did a quick test. I deleted the pending pvc vaicloud-mq-volume and did
kubectl apply -f pvc4test.yaml -n vaicloud-dev
on this file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vaicloud-mq-volume
  namespace: vaicloud-dev
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: rook-ceph-block
  resources:
    requests:
      storage: 2Gi

The new pvc got Bound en the pod was immediatly in Running.
I don't see any difference between the pvc in prod01 and the new one in dr01

Prod01: kubectl describe pvc vaicloud-mq-volume -n vaicloud-dev
Name:          vaicloud-mq-volume
Namespace:     vaicloud-dev
StorageClass:  rook-ceph-block
Status:        Bound
Volume:        pvc-b6de9f16-8a9d-41ae-ae2e-a5fc715377c0
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               velero.io/csi-volumesnapshot-class: csi-rbdplugin-snapclass
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      2Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       vaicloud-mq-54d4469b99-qfrzm
Events:        <none>

kubectl config use-context dr01                        
Switched to context "dr01".

kubectl describe pvc vaicloud-mq-volume -n vaicloud-dev
Name:          vaicloud-mq-volume
Namespace:     vaicloud-dev
StorageClass:  rook-ceph-block
Status:        Bound
Volume:        pvc-1db2cbaa-347b-4ab1-8f0b-9a803c28a393
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      2Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       vaicloud-mq-54d4469b99-qfrzm
Events:
  Type    Reason                 Age    From                                                                                                        Message
  ----    ------                 ----   ----                                                                                                        -------
  Normal  ExternalProvisioning   4m35s  persistentvolume-controller                                                                                 Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning           4m35s  rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-54b4855f96-b95cx_0ee214d7-199b-4fcc-8748-dfa6b513df21  External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-mq-volume"
  Normal  ProvisioningSucceeded  4m35s  rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-54b4855f96-b95cx_0ee214d7-199b-4fcc-8748-dfa6b513df21  Successfully provisioned volume pvc-1db2cbaa-347b-4ab1-8f0b-9a803c28a393

So to me it seems that Rook-Ceph is working correctly.

I did a new restore test and now i saw the following in the events:
failed to provision volume with StorageClass "rook-ceph-block": claim Selector is not supported

Could this be a issue?

Regards

@blackpiglet
Copy link
Contributor

claim Selector is not supported is not an issue.
This is expected behavior for the Velero data mover restore.
Please check the timeout restore created DataDownloads. If the DataDownloads took time longer than the timeout setting (4 hours), then the restore failure is expected. Enlarge the timeout can resolve it.

It's better to collected the timeout restore's debug bundle to investigate what happened exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datamover Needs info Waiting for information
Projects
None yet
Development

No branches or pull requests

4 participants