Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to restore volume with StorageClass, claim Selector is not supported #7946

Closed
soostdijck opened this issue Jun 27, 2024 · 23 comments
Closed

Comments

@soostdijck
Copy link

What steps did you take and what happened:
We have a setup where the NFS CSI driver creates the PV's dynamically once the PVC's are created/restored. This is done by specifying the correct storage classes.

However, when Velero backs up the PVC's, it adds a selector that breaks the PV creation by the NFS driver:

  selector:
    matchLabels:
      velero.io/dynamic-pv-restore: <pvc-name>.87z5b

What did you expect to happen:
We expect the restore to happen without Velero adding extra selectors that break the dynamic PV creation.

The output of the following commands will help us better understand what's going on:

velero backup create BACKUP_NAME --include-namespaces NAMESPACE --snapshot-move-data --snapshot-volumes --include-resources pvc

velero restore create --from-backup BACKUP_NAME

Environment:

Velero helm chart 6.4.x, Velero version 1.13.2
Kubernetes version 1.27

Note, this is a duplicate of this issue on the helm chart, but I think it belongs here

@kaovilai
Copy link
Member

it adds a selector that breaks the PV creation by the NFS driver

Sounds like a faulty NFS driver if it can't handle user (or velero) added labelSelector.

@soostdijck
Copy link
Author

soostdijck commented Jun 27, 2024

it adds a selector that breaks the PV creation by the NFS driver

Sounds like a faulty NFS driver if it can't handle user (or velero) added labelSelector.

It seems very unlikely to me that something as large and common as csi nfs would be "faulty".

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 28, 2024

This is done purposefully as/expected by Velero data mover restore workflow.
After Velero data mover restore completes, the restored PV will be bound to this PVC. Or in another words, this PVC can only be bound by Velero data mover restore.

If you don't see the binding happens, it means the data mover restore doesn't complete.
Then you can get the corresponding DataDownload CR to see the progress by kubectl get datadownload -n velero

@kaovilai
Copy link
Member

@soostdijck can you link docs that indicate label selector cannot be added?

@StellaV
Copy link

StellaV commented Jun 28, 2024

Hi @Lyndon-Li and @kaovilai

Thanks for the quick replies!

I think there's one confusion about how we use the NFS driver. We do not back up the PV's, as we use a storage class that dynamically creates them when a PVC is added. This is where it goes wrong. The dynamic PV's cannot be created due to the selector added by Velero, resulting in the error "failed to restore volume with StorageClass, claim Selector is not supported".

Here's an example of how we did it:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Delete=false
  name: sc-example
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs.example.com
  share: /
reclaimPolicy: Retain
volumeBindingMode: Immediate
mountOptions:
  - nfsvers=4.2
  - nolock
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-example
  labels:
    velero.io/include-in-backup: "true"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: sc-example
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: vsc-example
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: nfs.csi.k8s.io
parameters:
  server: nfs.example.com
  share: /
deletionPolicy: Delete

I hope this makes the issue a bit more clear?

Regards,
Stella

@Lyndon-Li
Copy link
Contributor

"failed to restore volume with StorageClass, claim Selector is not supported"

As I mentioned here, this is expected if you are running data mover restore.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 28, 2024

We do not back up the PV's, as we use a storage class that dynamically creates them when a PVC is added

Velero automatically select PVC and PV to back up. Varying from backup methods, sometimes PV object is backed up, sometimes it is not. And for data mover backup you are using, PV object is NOT backed up, and PVC object is backed up.

@StellaV
Copy link

StellaV commented Jul 1, 2024

@Lyndon-Li ,

Velero does by default select everything. But we only include the PVC in the backup. I'm not sure what the impact will be if we try to restore a dynamically created PV. But what I would prefer to see is that a new PV is created by the NFS driver once a PVC is restored by Velero.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jul 1, 2024

what I would prefer to see is that a new PV is created by the NFS driver once a PVC is restored by Velero

This will happen if the Velero after data mover restore completes. During the restore process, a PV will be created by the NFS driver and finally bind to the restored PVC after the data is restored to the PV.

Therefore, just check if you get any problem that the PVC is not restored successfully, just check if the DataDownload has completed successfully.

@StellaV
Copy link

StellaV commented Jul 1, 2024

what I would prefer to see is that a new PV is created by the NFS driver once a PVC is restored by Velero

This will happen if the Velero after data mover restore completes. During the restore process, a PV will be created by the NFS driver and finally bind to the restored PVC after the data is restored to the PV.

Therefore, just check if you get any problem that the PVC is not restored successfully, just check if the DataDownload has completed successfully.

That's exactly what I also expected to happen, but I get the "claim Selector is not supported" error instead. The DataDownload step is not even reached.

I see a similar issue here, which is the next driver we needed to test with Velero :)

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jul 1, 2024

we only include the PVC in the backup. I'm not sure what the impact will be if we try to restore a dynamically created PV

This (only backing up/restoring the PVC, without the pod) doesn't relate to the provision method (dynamically or statically), but relates to the PVC's bindingMode. Specifically, if the bindingMode is Immediate, everything works well.
But if the bindingMode is WaitForFirstConsumer, the restore will never complete until the PVC is mounted by a pod, see issue #7561. This is because of Kubernetes' designed constraint of WaitForFirstConsumer --- the PVC/PV is not provisioned until the pod is scheduled.

This is for PVC-only restore only, normal restores (PVCs with pod) doesn't have the problem.

@StellaV
Copy link

StellaV commented Jul 1, 2024

we only include the PVC in the backup. I'm not sure what the impact will be if we try to restore a dynamically created PV

This (only backing up/restoring the PVC, without the pod) doesn't relate to the provision method (dynamically or statically), but relates to the PVC's bindingMode. Specifically, if the bindingMode is Immediate, everything works well. But if the bindingMode is WaitForFirstConsumer, the restore will never complete until the PVC is mounted by a pod, see issue #7561. This is because of Kubernetes' designed constraint of WaitForFirstConsumer --- the PVC/PV is not provisioned until the pod is scheduled.

This is for PVC-only restore only, normal restores (PVCs with pod) doesn't have the problem.

That makes perfect sense. We have the bindingMode set to Immediate (see the yaml snippet I added earlier, this is almost the exact code we used). So, this should not be an issue

@Lyndon-Li
Copy link
Contributor

OK, then as the expected behavior, the PVC should be restored successfully. If it is not for your case, just share us the velero log bundle by running velero debug

@edhunter665
Copy link

We have same issue using vSphere CSI driver csi.vsphere.vmware.com.
If bindingMode is set to Immediate the restore fails (partially). Everything but PV and PVC gets restored.
If bindingMode is set to WaitForFirstConsumer the whole restore works fine.

@Lyndon-Li
Copy link
Contributor

@edhunter665 This doesn't look like the origin problem, so please open another issue and attach more details and the velero log bundle.

@datacore-tilangovan
Copy link

We have faced a similar issue, Binding mode is set to Immediate too.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
creationTimestamp: "2024-07-24T09:04:05Z"
name: mayastor-3-thin
resourceVersion: "124695"
uid: 0123e979-cfe8-47c5-91e3-4c3317128443
parameters:
protocol: nvmf
repl: "3"
thin: "true"
provisioner: io.openebs.csi-mayastor
reclaimPolicy: Delete
volumeBindingMode: Immediate

Events:
Type Reason Age From Message


Normal ExternalProvisioning 10s (x6 over 70s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "io.openebs.csi-mayastor" or manually created by system administrator
Normal Provisioning 7s (x7 over 70s) io.openebs.csi-mayastor_worker-velero-3_95efc4de-4d47-45a9-94ca-82333425cfdc External provisioner is provisioning volume for claim "test/ms-volume-claim"
Warning ProvisioningFailed 7s (x7 over 70s) io.openebs.csi-mayastor_worker-velero-3_95efc4de-4d47-45a9-94ca-82333425cfdc failed to provision volume with StorageClass "mayastor-3-thin": claim Selector is not supported

@yaraskm
Copy link

yaraskm commented Jul 29, 2024

I'm encountering the same issue as @datacore-tilangovan , with the difference being that we're using the HSPC CSI driver. Otherwise, the log output is the same and the volumes are never created. Our PVCs are created using Immediate as well.

I'm going to try creating an additional StorageClass for restores that sets volumeBindingMode to WaitForFirstConsumer as @edhunter665 alluded to, while doing restores only.

@yaraskm
Copy link

yaraskm commented Jul 30, 2024

I'm encountering the same issue as @datacore-tilangovan , with the difference being that we're using the HSPC CSI driver. Otherwise, the log output is the same and the volumes are never created. Our PVCs are created using Immediate as well.

I'm going to try creating an additional StorageClass for restores that sets volumeBindingMode to WaitForFirstConsumer as @edhunter665 alluded to, while doing restores only.

I was able to make this flow work, with the help of a small script.

After switching from Immediate to WaitForFirstConsumer, the restore operation no longer fails immediately. Instead, it waits for the PVCs being restored to be mounted by a Pod, up to the expected timeout. Since my backups only contain PVCs, I wrote a small Python utility that uses the Kubernetes SDK to:

  1. List all PVCs in a target namespace that are in a Pending state, with the label velero.io/dynamic-pv-restore
  2. Create a Job that tries to mount all of these PVCs
  3. When the Pod actually starts, it simply echos a message and exits

Using this flow, I was able to successfully restore a backup that contained only PVCs, created using the CSI Snapshotter + data movement functionality.

@Lyndon-Li
Copy link
Contributor

Though I don't have enough information to confirm, I think the failures mentioned in this issue are most likely caused by issue #7898.
So please check Velero logs or DataUpload/DataDonwnload CR status, if you see a message similar to below example, the problem will be fixed by #7898:

 message: 'found a dataupload openshift-adp/backup20-llp79 with expose error: Pod
  is unschedulable: 0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims.
  preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling...
  mark it as cancel'

The claim Selector is not supported message in the PVC is not an error, but an expected behavior.

For the volume-only restore, Velero data mover restore at present only supports Immediate volumes; there may be a way to go better with WaitForFirstConsumer volumes which will be implemented in future releases see #8044

@yaraskm
Copy link

yaraskm commented Aug 2, 2024

Thanks for the thorough explanation @Lyndon-Li ! Do you have a rough ETA on the v1.14.1 release with the fix?

@Lyndon-Li
Copy link
Contributor

A tentative ETA is by the end of Aug.

Copy link

github-actions bot commented Oct 5, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@github-actions github-actions bot added the staled label Oct 5, 2024
Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants