Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't delete ZFS snapshot when used by PVC at one point #286

Open
Kidswiss opened this issue Aug 20, 2024 · 3 comments
Open

Can't delete ZFS snapshot when used by PVC at one point #286

Kidswiss opened this issue Aug 20, 2024 · 3 comments

Comments

@Kidswiss
Copy link

Kidswiss commented Aug 20, 2024

Hi

I've been playing around with the operator a bit. I'm heavily relying on snapshots for my backups and saw, that they don't get cleaned up any more. I'm using 2 replicas for the disks.

Some more information:

  • The snapshots get successfully deleted from one host (kerrigan03)
  • The other host still contains the snapshot
  • It's possible to simply delete the snapshot via ZFS destroy
  • Deleting the snapshot manually and then restarting the satellite pod solves the issue
  • The snapshot was in use by a PVC
  • The PVC and volumesnapshot were deleted at more or less the same time, this feels like a race condition
  • k3s 1.30.1
linstor snapshot list
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | SnapshotName                                  | NodeNames              | Volumes  | CreatedOn           | State    |
|===============================================================================================================================================================|
| pvc-471a5dd7-1f5b-4d36-9eb2-849faff9f0ca | snapshot-0b5673c2-d829-4965-9d01-d8edd6a39e43 | kerrigan02, kerrigan03 | 0: 1 GiB | 2024-08-20 09:12:48 | DELETING |
| pvc-488b72a9-7753-41ac-b03e-df7c311e4b12 | snapshot-765e2075-d352-402b-a085-c8fd2f501a20 | kerrigan02, kerrigan03 | 0: 1 GiB | 2024-08-20 09:18:31 | DELETING |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

Error report:


root@kerrigan02:/var/log/linstor-satellite# cat ErrorReport-66C45CF9-C43D0-000001.log
ERROR REPORT 66C45CF9-C43D0-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Satellite
Version:                            1.28.0
Build ID:                           959382f7b4fb9436fefdd21dfa262e90318edaed
Build time:                         2024-07-11T10:21:06+00:00
Error time:                         2024-08-20 09:18:57
Node:                               kerrigan02
Thread:                             DeviceManager

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line piraeusdatastore/piraeus-operator#69

Error message:                      Failed to delete zfs snapshot

Error context:
        An error occurred while processing snapshot 'snapshot-765e2075-d352-402b-a085-c8fd2f501a20' of resource 'pvc-488b72a9-7753-41ac-b03e-df7c311e4b12'
ErrorContext:
  Details:     Command 'zfs destroy zfspv-pool/linstor/pvc-488b72a9-7753-41ac-b03e-df7c311e4b12_00000@snapshot-765e2075-d352-402b-a085-c8fd2f501a20' returned with exitcode 1. 

Standard out: 


Error message: 
cannot destroy 'zfspv-pool/linstor/pvc-488b72a9-7753-41ac-b03e-df7c311e4b12_00000@snapshot-765e2075-d352-402b-a085-c8fd2f501a20': snapshot has dependent clones
use '-R' to destroy the following datasets:
zfspv-pool/linstor/pvc-158db1b3-a10a-4424-af3e-6149c63adc72_00000




Call backtrace:

    Method                                   Native Class:Line number
    checkExitCode                            N      com.linbit.extproc.ExtCmdUtils:69
    genericExecutor                          N      com.linbit.linstor.storage.utils.Commands:103
    genericExecutor                          N      com.linbit.linstor.storage.utils.Commands:63
    delete                                   N      com.linbit.linstor.layer.storage.zfs.utils.ZfsCommands:104
    deleteSnapshotImpl                       N      com.linbit.linstor.layer.storage.zfs.ZfsProvider:469
    deleteSnapshotImpl                       N      com.linbit.linstor.layer.storage.zfs.ZfsProvider:70
    deleteSnapshot                           N      com.linbit.linstor.layer.storage.AbsStorageProvider:810
    processSnapshotVolumes                   N      com.linbit.linstor.layer.storage.AbsStorageProvider:392
    processSnapshot                          N      com.linbit.linstor.layer.storage.StorageLayer:333
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:949
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:967
    processSnapshot                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:919
    processSnapshots                         N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:610
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:211
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:331
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1204
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:778
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:672
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.

I'll gladly provide more information if necessary.

EDIT:

I can reproduce the issue with https://kubestr.io/:

kubestr csicheck -s piraeus-storage-replicated -v piraeus-snapshots

Let if fail and check the volume snapshots on the cluster.

@Kidswiss Kidswiss changed the title Can't delete ZFS snapshot when using cloned PVCs Can't delete ZFS snapshot when used by PVC at one point Aug 20, 2024
@WanzenBug
Copy link
Member

I guess there is a potential race conditiion between the cloned PVC and snapshot. ZFS does not allow deleting snapshots that are used by a cloned volume. Because the CSI driver does receive separate requests for deleting the volume and snapshot, the driver might try to delete the snapshot first, which then fails. After deleting the volume, I guess LINSTOR should try to delete the snapshot again.

@WanzenBug WanzenBug transferred this issue from piraeusdatastore/piraeus-operator Aug 26, 2024
@Kidswiss
Copy link
Author

Yeah, makes sense.

Usually K8s operators re-try ever reconcile loop until the desired state is reached.

However, I don't know if this is the operator's concern, or if this should be handled upstream in linstor itself?

@WanzenBug
Copy link
Member

if this should be handled upstream in linstor itself

Ideally yes. There might be some trickery with zfs promote that could even make this work the same as LVM for most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants