Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Improve CSI Snapshotting Performance #6860

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

anshulahuja98
Copy link
Collaborator

@anshulahuja98 anshulahuja98 commented Sep 22, 2023

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #6165

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@github-actions github-actions bot added the Area/Design Design Documents label Sep 22, 2023
Signed-off-by: Anshul Ahuja <[email protected]>
Signed-off-by: Anshul Ahuja <[email protected]>
@codecov
Copy link

codecov bot commented Sep 22, 2023

Codecov Report

Merging #6860 (5d81317) into main (81057b9) will increase coverage by 0.83%.
Report is 235 commits behind head on main.
The diff coverage is 71.05%.

@@            Coverage Diff             @@
##             main    #6860      +/-   ##
==========================================
+ Coverage   60.19%   61.02%   +0.83%     
==========================================
  Files         242      255      +13     
  Lines       25670    27066    +1396     
==========================================
+ Hits        15451    16516    +1065     
- Misses       9143     9368     +225     
- Partials     1076     1182     +106     
Files Coverage Δ
pkg/backup/pvc_snapshot_tracker.go 89.09% <100.00%> (+0.41%) ⬆️
pkg/backup/item_backupper.go 68.66% <69.44%> (-1.04%) ⬇️

... and 80 files with indirect coverage changes


## Implementation
## For Approach 2
- Current code flow `backupItem` in backup.go is invoked for each resource -> this further invokes `itembackupper.backupItem` -> `backupItemInternal`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before looking at code, first read through the impl

### Approach 1: Add support for VolumeGroupSnapshot in Velero.
- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting

### Approach 2: Invoke CSI Plugin in parallel for a group of PVCs.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code sample is based on this.
This is based on the principals of approach 3 and approach 1 can be further implemented on top of current setup.

itemFiles = append(itemFiles, additionalItemFiles...)
}
wg.Wait()
close(additionalItemFilesChannel)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refine the channel stuff / code accuracy later, currently consider this only for representing draft idea on approach.

Copy link
Collaborator

@sseago sseago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think Approach 3 may be the cleanest. It doesn't require adding further special case code to the backup workflow. Create a new Pod action plugin in the CSI plugin which essentially does the work of the CSI PVC plugin for each PVC mounted by the pod (those for which the CSI plugin would match, not the FS backup volumes). Refactor the PVC plugin so that common functionality needed by this and pod plugin is shared, and before returning each PVC as an additional item, set an annotation on the PVC that the PVC CSI plugin will use to know it should ignore that PVC and not snapshot it.


## Approach 3: Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs.
- Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs.
- This would lead to code and logic duplication across CSI Plugin and the pod plugin.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mitigate the duplication by pulling shared functionality into functions called by both plugins.

@anshulahuja98
Copy link
Collaborator Author

I actually think Approach 3 may be the cleanest. It doesn't require adding further special case code to the backup workflow. Create a new Pod action plugin in the CSI plugin which essentially does the work of the CSI PVC plugin for each PVC mounted by the pod (those for which the CSI plugin would match, not the FS backup volumes). Refactor the PVC plugin so that common functionality needed by this and pod plugin is shared, and before returning each PVC as an additional item, set an annotation on the PVC that the PVC CSI plugin will use to know it should ignore that PVC and not snapshot it.

will try to experiment this route as well. Need to probably see in code how it fits together.

Anshul Ahuja added 3 commits October 16, 2023 12:02
Signed-off-by: Anshul Ahuja <[email protected]>
Signed-off-by: Anshul Ahuja <[email protected]>
for itemFilesFromChannel := range additionalItemFilesChannel {
itemFiles = append(itemFiles, itemFilesFromChannel)
}
for err := range errChannel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to eat some errors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley ahh yes, we were talking about this on slack earlier. Before this change, velero backs up each additionalItem in turn, and upon the first error, it's returned as an error, and the others aren't attempted. Now that we're doing them in parallel, all will start, so it's possible that more than one will error out.

Since the failing additionalItem should log an error for its own failure, the full error list shouldn't have anything missing. That being said, rather than just returning the error for the additional item, we probably want a more descriptive error here anyway, since err here doesn't reference the current item at all. Perhaps logic along the lines of this? If errChannel isn't empty return an error with message "One or more additional items for $currentItem failed: (string join of individual err messages from errChannel).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Perhaps logic along the lines of this? If errChannel isn't empty return an error with message "One or more additional items for $currentItem failed: (string join of individual err messages from errChannel)."
I can take care of that.
If that's enough to address this concern

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulahuja98 That addresses the concern on my end. I think that should make sure that no errors are swallowed here. Net effect is if a pod has 2 PVCs and both PVC backups fail, then each PVC's backup error should show up as a PVC error, and then the pod will fail with the combined error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it will be reasonable to put them in one message, think about like 5-10 the message will become unwieldy in the logs.

Can we create a single error, like you said, and then log every other error?

Or is that what was proposed and I missed it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A go routine could be pulling things off the error channel as it runs too, so you can see the logs as they fail.

Then the error is just if the error handling go routine was used.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works too. Makes sense to me

- Pass a list of PVCs to CSI Plugin for Approach 1?

## How to Group Snapshots
- PVCs which are being used by a pod/deployment should be snapshotting together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically, any sets of PVCs could be correlated so need to be grouped, not only PVCs from the same pod/deployment.
More accurately, we could say PVCs from one application may need to be snapshotted together, because the group snapshot is actually for helping to achieve data consistency across the application, similar to pre/post hooks. However, there is no real entity for application in Kubernetes, so in order to achieve group snapshot support, more details need to be considered.

Moreover, we also need to consider how to support data movement of grouped snapshots.

So I suggest either we visit more details and create a sophisticated solution for group snapshot or we drop the topic for supporting it in this PR and leave it into a separate PR in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not personally opposed to waiting on the grouping of PVC's but did want to ask one question.

If we have the snapshot that is taken in a consistent way, do we need to change the data movement? I assume that this process could happen at slightly different times because the data itself should be consistent, and the movement of the bytes shouldn't impact the consistency, or am I missing something?

Copy link
Contributor

@Lyndon-Li Lyndon-Li Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data movement doesn't impact the consistency.
On the other hand, the group snapshot does impact data mover on its snapshot expose and manipulation, for example, the snapshot associated objects are different, i.e., VolumeGroupSnapshot and VolumeGroupSnapshotContent, without any changes, the data movement cannot support it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I agree, once those API's go GA or rather before they go GA or beta we should support those API's

I guess I am confused on how this impacts the current idea.

The way I see this design, and I could be way off, is that when no other information about the system besides that a pod is using multiple PVCs and is using CSI, then we should trigger the CSI backups in a way that leads the consistent data, as well as speed up the backup time.

Is it not a good goal, given the above (ignoring all the work around groups and such) to do this incremental thing to help in multiple ways? @anshulahuja98 is the above understanding what you are going for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes correct @shawn-hurley.
We can decouple VolumeSNapshotGroup from this design since it's not even beta yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See these comments, to me the current change won't help to improve the consistency and performance in NORMAL case. At present, under the help of BIAv2, there are two phases for CSI snapshot creation:

  1. Create snapshot and wait snapshot handle appears. This phase is running in sequence among PVCs. If we dig into the process of snapshot creation on the storage side, this process should be very fast under NORMAL case.
  2. Wait for snapshot get to ReadyToUse. For this, data is going to moved from the storage side, so it takes time. This phase is running in parallel

Now, let's see the differences with or without the current changes regarding to consistency and performance improve under NORMAL case:

  • For consistency: I think the idea is having PVCs in one pod to be snapshotted as closer as possible. Because phase 1 is very fast and phase 2 is asynchronous, PVCs from one pod have actually been snapshotted very closely, among them are simple resource backups. If we simply make the backup workflow async as the current changes, I don't know how much more differences it makes. Moreover don't forget the capability of CSI driver, both CSI driver and storage may have limit to run snapshots together.
  • For performance: Since phase 2, the most time consuming part has already been async, I don't know how much performance improvement is there to make main backup workflow async once more

You may argue that the statement of 1 is very fast is not always true for all storages, I agree, but I would rather regard them as the flaw of the storage itself, because technically, this can be very fast.

Anyway, we at least need to do some tests to prove how much improvement for consistency and performance in various environments, then we come back to consider the current changes.
Forgive my caution on these changes, because they actually make the primary workflow very different, there would come many unexpected problems. I have list some, but I cannot tell all. So if we don't know the benefits for sure, I don't think it is a good bargain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that one of the systems that takes a long time to snapshot is aws-ebs as well as azure disks. I may be recalling wrong.

Can someone verify my recall, I can't find any documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed of snapshot-taking is not only related to the storage provider.
It's reported that GCP's big-size persistent disk snapshot handle creation consumes more than 10 minutes, but IMO those are not common cases for most storage providers.

- Invoke current CSI plugin's Execute() in parallel for a group of PVCs.
- Perf implications of parallel calls?

## Approach 3: Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same consideration of data consistency across application, we could not assume that the pre/post plugin is on pod basis, since we are changing the workflow dramatically, we need to target the ultimate solution, instead of basing on the current situation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we even have a tracking issue for application level pre-post.
If that's not even on the near roadmap, I would suggest we take assumption on current flow and optimize it.

If the app consistent option comes in the future we can try to accommodate in a similar way. I am not sure how this is blocking in any way to this current PR.

<!-- ## Background -->

## Goals
- Reduce the time to take CSI snapshots.
Copy link
Contributor

@Lyndon-Li Lyndon-Li Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there are cases make the call in unexpected wait, but I also think they are not totally due to Velero's fault, but mostly on the specific CSI drivers, otherwise, the drivers should detect the cases and fail earlier and then Velero would fail earlier.
Therefore, for this single goal, I don't think a dramatic workflow change in Velero is a good bargain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can special case the code to only work for CSI snapshots.
and further we will make it configurable with default parallelism as 1. This won't lead to any drastic workflow change.
If users have CSI drivers which are more performant, they can increase the parallel count.

in yesterday's community meet we even discussed that based on how it plays out here in terms of real world, we can use these parallel approach to other things in the backup /restore flow to enhance velero perf.


## Goals
- Reduce the time to take CSI snapshots.
- Ensure current behaviour of pre and post hooks in context of PVCs attached to a pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre/post hooks are for the consideration of achieving better data consistency. But ultimately speaking, the consistency is not on pod basis but on application basis.
Therefore, the current solution that having hooks on pod basis is not the ultimate solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently velero code works that way - at a pod level
I don't see any tracking issue to introduce app level consistency. If that's not even on the roadmap, I'll suggest to decouple it from this perf enhancement

### Approach 1: Add support for VolumeGroupSnapshot in Velero.
- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting

### Approach 2: Invoke CSI Plugin in parallel for a group of PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some details I can come out, and I think there may be more:

  1. There may be combinations of different backup types. For example, some volumes may be backed up by fs-backup and others by CSI snapshots, and we even don't know which type a volume go before the workflow for that backup type is really launched (considering the opt-in/out options for fs-backup)
  2. Not only pods result in PVC backups, other resource may also do, for example, VirtualMachine resources for KubeVirt or other k8s managed VM solutions. So I am afraid, in order to get all PVCs, PodAction plugin is not enough

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aim is not to focus on all PVCs, the scope of current PR is to optimize it for CSI snapshots.
Other scenarios based on perf requirement we can look into.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Oct 19, 2023

@anshulahuja98 Let me reply here centrally, which answers your above comments.

Actually, I am somehow confused on what the current PR is going to achieve:
1. Do we want to design to support group snapshot?
As mentioned above, there are far more details need to cover and the current Velero workflow is also not quite ready for group snapshot. So I don't think the approaches in this PR could support it

2. Do we want to solve the performance problem in the unexpected cases?
On the one hand, we regard it as uncommon cases and not happen normally or mostly. On the other hand, we are changing the generic workflow of resource traverse, in order to solve it. Of course, we can say that we only want to target to pod PVCs for CSI snapshot, however, since we are changing the generic workflow, we have to consider that it should work for all the cases.
For example, we want to collect the PVCs before really backing them up, but without going deeply into the specific backup workflow, we have no way to tell if a PVC should be handled by fs-backup, CSI snapshot or CSI snapshot data movement.
Moreover, we are actually changing the generic workflow dramatically --- the PVCs are first enumerated/collected and then backed up, I am afraid there are more unpredictable cases.

Finally, if this goal is the primary thing we focus, I think we may have other ways with smaller impacts, we can discuss.

3. Do we want to introduce a new way to group PVCs on pod basis?
If so, what problem do we want to solve through this approach?

  • It doesn't bring better data consistency, the existing workflow already guarantees the quiescence and un-quiescence from pod level
  • It doesn't represent the future, as mentioned about the application scope consistency, do it on pod basis is not enough
  • It doesn't improve the performance in the normal case, in the normal case, we have moved the most time-consuming part into async-operations under BIAv2

- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting

### Approach 2: Invoke CSI Plugin in parallel for a group of PVCs.
- Invoke current CSI plugin's Execute() in parallel for a group of PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice, the idea is to group CSI PVC in a way (for example, group the PVCs per pod or ns, with maximum entries in the group of n), and we backup the PVCs in each group in parallel, regardless it's via the CIS plugin or not.

From the code level, the backupper may evolve to support item_groups, and it will call backupItem parallel for the entries in one group, will that work? This concept of item_groups may help us improve the parallelism in one backup in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the code changes in current PR
it kind of solves for item_group by invoking backupItem in parallel for additionalItems

in future this can be extended to other resources using similar way where we write BIA to group items and then parallel in backup.

@shawn-hurley
Copy link
Contributor

@anshulahuja98 Will this speed up all plugin calls or just the ones or just for pods?

Is there a way to run all the plugins for a given resource X async? Considering that the plugins can add a significant time to a backup of a particular resource, I wonder if this could help many folks who have third-party plugins?

@sseago
Copy link
Collaborator

sseago commented Nov 6, 2023

@shawn-hurley The problem with running plugin actions async is that the primary action of item action plugins is to modify the content of the resource that's being backed up or restored. So for the backup case, if an item has 3 plugins that apply to it, the Execute calls of the 3 plugins are called sequentially, each getting the resource as modified by the previous plugin. After this, the final modified resource content is saved in the backup.

Note that this design doesn't call for running any plugin Execute calls asynchronously (even for pods) -- what happens asynchronously is that any additional items returned by plugin calls (which represent other items that must be backed up right now, before continuing with the other items). If a plugin returns 5 items that all need to be backed up now, with this design, they will all be backed up in parallel, with the current item's backup func waiting until they're all done. For each of these in-parallel items being backed up, their own plugins will be called sequentially, just as with any other item.

Also, this is not necessarily only going to happen for pods -- any resource with a plugin that returns a list of additional items will have those backed up in parallel before continuing -- although at the moment, I think pods may be the only type for which a plugin is likely to return multiple items -- PVCs in this case.

@sseago
Copy link
Collaborator

sseago commented Nov 6, 2023

@shawn-hurley Note that a plugin can have an aysnchronous component to its action -- that's where the backup item operations fit in (and used for CSI snapshotting), but the plugin itself doesn't launch a direct asynchronous action -- it simply starts an action that might take a while and runs in the background under the control of something else (i.e. create VolumeSnapshot -- CSI infrastructure then does its thing), and the plugin returns with the operation ID. The only other plugin involvement from that point forward is that Velero will later call the plugin's Progress method (passing in the operation ID), so that the plugin can tell velero that the operation is complete, or it's not yet done.

@shawn-hurley
Copy link
Contributor

When plugins say three are run synchronously, are they always in the same order? What determines that order?

@sseago
Copy link
Collaborator

sseago commented Nov 6, 2023

@shawn-hurley The order is always the same with a given set of plugins, but it's best not to rely on it since when one plugin is written, you're never sure what other plugins will be added later. But the order is as follows. First, velero gets a list of binaries that include plugins -- first is "velero" -- to capture item actions included in velero itself, then velero looks in the plugin dir in the pod and grabs the plugin binaries, in filename order (so for example, in my current cluster:

$ ls /plugins
Dockerfile  velero-plugin-for-aws  velero-plugin-for-csi  velero-plugins

Which means internal plugin actions are returned first, then aws plugin actions, then CSI, then openshift velero plugin actions. AWS plugin doesn't contain BIAs, so this means internal actions, then CSI, then openshift (but if the binary we built was called "openshift-velero-plugin", as it probably should be, then those actions would run before CSI). But since we don't have any interdependencies between CSI actions and openshift actions, that won't matter in practice.

Then, for each plugin binary, a list of registered plugin actions is returned, sorted by the name it was registered from.

In other words, if you're a plugin author, you can control what order plugins within your own binary are called, and you know they'll be called after any internal actions, but it's best not to make assumptions about whether there are other plugins registered and, if so, when their actions run.

@blackpiglet
Copy link
Contributor

Another possible solution is integrating the CSI plugin into the Velero code base, then we don't need to consider how to make the plugins work parallel.
Goroutine is enough to do that.

@anshulahuja98
Copy link
Collaborator Author

Another possible solution is integrating the CSI plugin into the Velero code base, then we don't need to consider how to make the plugins work parallel. Goroutine is enough to do that.

I think I discussed this with @sseago, but that would require lot more effort since current CSI Datamover leverages the plugin format of returning additional items etc.

The item itself would be more expensive to light up given we'd need to duplicate bunch of plugin logic in core code.
But yes it will make life easy for concurrency of snapshots.

@blackpiglet
Copy link
Contributor

Another possible solution is integrating the CSI plugin into the Velero code base, then we don't need to consider how to make the plugins work parallel. Goroutine is enough to do that.

I think I discussed this with @sseago, but that would require lot more effort since current CSI Datamover leverages the plugin format of returning additional items etc.

The item itself would be more expensive to light up given we'd need to duplicate bunch of plugin logic in core code. But yes it will make life easy for concurrency of snapshots.

Indeed, integrating the CSI plugin requires quite some effort, but it is not just duplicating the CSI plugin code into the Velero server code base. If we choose to this way, there will be no CSI plugin in the future release.

@blackpiglet
Copy link
Contributor

blackpiglet commented Jan 25, 2024

Checked whether it's possible to skip the waiting on some specific error.
IMO, this should work, but it's not a general solution.
Of course, the CSI plugin code can easily quit the polling of VSC when the VSC's Status.Error.Message contains some special information, but the error message is returned from the vendor-provided CSI snapshot driver. It's not easy to know what is the error message content, and the CSI external-snapshotter's has logic to check whether the CSI driver's error is a final error. If it's a final error, the external-snapshotter will delete the VSC on the next reconciliation. If that happens the Velero CSI driver will quit polling due to not found error.

From the error handling perspective, I think this vendor-provided CSI driver should mark the error as a final one when it's not reasonable to retry the snapshot creation.

https://github.com/kubernetes-csi/external-snapshotter/blob/fc49f3258b050c7c0f9f0ea5470b5cd51c5707f5/pkg/sidecar-controller/snapshot_controller.go#L341-L347

@anshulahuja98
Copy link
Collaborator Author

Checked whether it's possible to skip the waiting on some specific error. IMO, this should work, but it's not a general solution. Of course, the CSI plugin code can easily quit the polling of VSC when the VSC's Status.Error.Message contains some special information, but the error message is returned from the vendor-provided CSI snapshot driver. It's not easy to know what is the error message content, and the CSI external-snapshotter's has logic to check whether the CSI driver's error is a final error. If it's a final error, the external-snapshotter will delete the VSC on the next reconciliation. If that happens the Velero CSI driver will quit polling due to not found error.

From the error handling perspective, I think this vendor-provided CSI driver should mark the error as a final one when it's not reasonable to retry the snapshot creation.

https://github.com/kubernetes-csi/external-snapshotter/blob/fc49f3258b050c7c0f9f0ea5470b5cd51c5707f5/pkg/sidecar-controller/snapshot_controller.go#L341-L347

We have actually done this exact workaround in our downstream consumption.
That is why this particular PR has not been that high a priority.

Again to callout this is still a workaround, and scaling this to multiple vendors might be tricky.

@blackpiglet
Copy link
Contributor

@anshulahuja98
vmware-tanzu/velero-plugin-for-csi#226 (comment)
When discussing the async operation error handling, I propose a way to retry the error with the retry limitation.
Maybe we can also use that here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve CSI Snapshotting Performance
6 participants