Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #8132: managed fields failure during Restore #8133

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mpryc
Copy link
Contributor

@mpryc mpryc commented Aug 20, 2024

Thank you for contributing to Velero!

Please add a summary of your change

This commit addresses issue #8132, where an error randomly appears in the logs during the restore operation.

The error occurs due to a race condition when attempting to patch managed fields on an object that has been modified in the cluster. The error message indicates that the operation cannot be fulfilled because the object has been modified, suggesting that changes should be applied to the latest version.

To resolve this, a retry mechanism has been implemented in the restore process when encountering this error, ensuring that managed fields are properly restored without the error message appearing in the logs.

Does your change fix a particular issue?

Fixes #8132

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.
    • not needed

@mpryc
Copy link
Contributor Author

mpryc commented Aug 20, 2024

/kind changelog-not-required

@github-actions github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Aug 20, 2024
@anshulahuja98
Copy link
Collaborator

I would prefer if this is backported to 1.13 as well

@blackpiglet
Copy link
Contributor

blackpiglet commented Aug 20, 2024

@mpryc
I suggest adding a changelog file for this PR.

Copy link

codecov bot commented Aug 20, 2024

Codecov Report

Attention: Patch coverage is 37.03704% with 17 lines in your changes missing coverage. Please review.

Project coverage is 59.07%. Comparing base (86963bf) to head (b091d49).
Report is 140 commits behind head on main.

Files with missing lines Patch % Lines
pkg/restore/restore.go 37.03% 13 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8133      +/-   ##
==========================================
+ Coverage   58.99%   59.07%   +0.08%     
==========================================
  Files         364      364              
  Lines       30270    30310      +40     
==========================================
+ Hits        17858    17906      +48     
+ Misses      10965    10959       -6     
+ Partials     1447     1445       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kaovilai
Copy link
Member

Potential for scope overlap with
#8063

@mpryc
Copy link
Contributor Author

mpryc commented Aug 21, 2024

@mpryc I suggest adding a changelog file for this PR.

Added, dunno how to remove kind/changelog-not-required label once it has been added, possibly the owner of the repo can do that.

@mpryc
Copy link
Contributor Author

mpryc commented Aug 21, 2024

Potential for scope overlap with #8063

I am not convinced about that. The #8063 is about retry on status update, this fix is about race condition on a particular object during restore operation and applying managed fields. It could be a general design rework of restore operation, but that's bigger chunk of work.

In short this fix addresses situation when after first time the object is created in the cluster at:

createdObj, restoreErr = resourceClient.Create(obj)

And before the patch for managed fields is being calculated at:

withoutManagedFields := createdObj.DeepCopy()

There are number of operations on the in-cluster object including status update and it happens that the object being patched is not the one that represents current cluster version. This is of course done to save API calls to the cluster and we should only re-try such operation when there is real error.

I believe this is not really what you are looking into within #8063 as we are explicitly retrying on the object conflict and not other problems such as non reachable cluster API:

https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict

@sseago
Copy link
Collaborator

sseago commented Aug 21, 2024

@kaovilai I agree with @mpryc This seems to be a completely different retry needed. Not velero CRs -- restored resources, and not related to APIServer outages but conflict errors, so both the error matched and the backoff/duration profile is completely different.

@sseago sseago removed the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Aug 21, 2024
@sseago
Copy link
Collaborator

sseago commented Aug 21, 2024

@mpryc I removed the changelog-not-required label

@sseago sseago closed this Aug 21, 2024
@sseago sseago reopened this Aug 21, 2024
@@ -0,0 +1 @@
Random race condition in the restore with managed fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpryc changelog needs to have PR number, not issue number -- s/8132/8133/ in the filename.

Also, I'd rephrase the changelog to describe the fix rather than the bug (since it will appear in release notes):
"Fixed race condition for conflicts on patching managed fields" or soething like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

This commit addresses issue vmware-tanzu#8132, where an error randomly
appears in the logs during the restore operation.

The error occurs due to a race condition when attempting
to patch managed fields on an object that has been modified
in the cluster. The error message indicates that the operation
cannot be fulfilled because the object has been modified,
suggesting that changes should be applied to the latest version.

To resolve this, a retry mechanism has been implemented in the restore
process when encountering this error, ensuring that managed fields
are properly restored without the error message appearing in the logs.

Signed-off-by: Michal Pryc <[email protected]>
@mpryc mpryc force-pushed the managed-fields-fix branch from 097a9f8 to b091d49 Compare August 21, 2024 20:26
@ywk253100
Copy link
Contributor

@mpryc The patch API should not report the error mentioned in issue #8132, the reason causes #8132 is that the resourceVersion field is included when doing the patch (the resourceVersion is changed between the time window object is created and patched).

So a more reasonable solution would be that make sure the resourceVersion not being included in the patch data, e.g. make the resourceVersion same when compute the patch data.

Copy link
Contributor

@ywk253100 ywk253100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments

@mpryc
Copy link
Contributor Author

mpryc commented Aug 28, 2024

So a more reasonable solution would be that make sure the resourceVersion not being included in the patch data, e.g. make the resourceVersion same when compute the patch data.

@ywk253100 won't this have possible other issues? There is a risk of applying a patch based on an out-of-sync version of the object. This can lead to unintended modifications, conflicts, or even overwriting of data that has been updated since the original resourceVersion was retrieved. Kubernetes relies on an optimistic concurrency control model in which we may be breaking by assuming we always have correct version in our cache.

@ywk253100
Copy link
Contributor

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.

withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)

I'm curious why the patch operation reported the confliction error? Only patch with resourceVersion or update could report such error.

@mpryc
Copy link
Contributor Author

mpryc commented Aug 30, 2024

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.

withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)

I'm curious why the patch operation reported the confliction error? Only patch with resourceVersion or update could report such error.

@ywk253100 Any attempt to patch object in the cluster including other then resourceVersion fields like status, can result in a conflict if the patch is generated from an outdated in-memory version of the object.

When the patch is applied, it may conflict with the current state of the object, leading to a conflict error. To address this, it’s important to implement a retry mechanism. Specifically defined for such scenarios, client-go/util/retry allows the operation to be retried using the latest version of the object when a conflict occurs.

I believe the correct fix is actually to perform retry on conflict which will ensure only this scenario is taken into consideration:

https://github.com/kubernetes/client-go/blob/02a19c375c491042890be396d94a26a69da89563/util/retry/util.go#L68-L75

@ywk253100
Copy link
Contributor

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.

withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)

I'm curious why the patch operation reported the confliction error? Only patch with resourceVersion or update could report such error.

@ywk253100 Any attempt to patch object in the cluster including other then resourceVersion fields like status, can result in a conflict if the patch is generated from an outdated in-memory version of the object.

When the patch is applied, it may conflict with the current state of the object, leading to a conflict error. To address this, it’s important to implement a retry mechanism. Specifically defined for such scenarios, client-go/util/retry allows the operation to be retried using the latest version of the object when a conflict occurs.

I believe the correct fix is actually to perform retry on conflict which will ensure only this scenario is taken into consideration:

https://github.com/kubernetes/client-go/blob/02a19c375c491042890be396d94a26a69da89563/util/retry/util.go#L68-L75

Hi @mpryc If you see the code I pasted, the only difference between the two objects is the ManagedFields, so the patch data should only contain the changed ManagedFields part, right? Do you mean this kind of patch is also possible to cause conflict?
Per my understanding, it is not. Correct me if I'm wrong.
And the Kubernetes API doc says

Patches will never cause optimistic locking failures, and the last write will win. Patches are recommended when the full state is not read before an update, or when failing on optimistic locking is undesirable. When patching complex types, arrays and maps, how the patch is applied is defined on a per-field basis and may either replace the field's current value, or merge the contents into the current value.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/

@mpryc
Copy link
Contributor Author

mpryc commented Sep 2, 2024

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.

withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)

Hi @mpryc If you see the code I pasted, the only difference between the two objects is the ManagedFields, so the patch data should only contain the changed ManagedFields part, right?

@ywk253100 I agree with you. While the chances of a conflict error due to the generated patch are minimal, they are still possible, as evidenced by the errors we see in our logs.

I can't think of any other scenarios that could be causing this specific error. Implementing retries on this patch seems to be a reasonable approach to address the issue.

The error the object has been modified; please apply your changes to the latest version and try again in the logs is clearly linked to this line and most likely results from a race condition. Although I can't think of any other possible causes, the problem persists. If you have suggestions for a better fix, I'm open to testing them:

if _, err = resourceClient.Patch(name, patchBytes); err != nil {

@ywk253100
Copy link
Contributor

ywk253100 commented Sep 3, 2024

@mpryc
I checked the Kubernetes API server code, and seems only PATCH with resourceVersion and PUT API calls could cause the conflict error. If what I'm saying is right, the PATCH without resourceVersion should not cause the error.

If the PATCH operation could cause the conflict error, we need to add the retry logic for every call of PATCH.
So let's spend a bit more time to make sure the root cause before merging the PR, WDYT?

BTW, in which version of Velero and Kubernetes that you see the issue? Is it reproducible?
I saw the failed resource is a namespace, is this issue specific to namespaces or not?

@mpryc
Copy link
Contributor Author

mpryc commented Sep 3, 2024

@ywk253100
That was visible on Velero 1.14. It's hard for me to check what the generated patch looked like and if it had the resourceVersion as this is random error visible post CI run in our logs, so more data is lost, but if needed I could build custom velero and collect more data for it. Would the original object, patched object, generated patch be sufficient in more debugging?

@ywk253100
Copy link
Contributor

@ywk253100 That was visible on Velero 1.14. It's hard for me to check what the generated patch looked like and if it had the resourceVersion as this is random error visible post CI run in our logs, so more data is lost, but if needed I could build custom velero and collect more data for it. Would the original object, patched object, generated patch be sufficient in more debugging?

@mpryc That's OK for the first step of further debugging.

BTW, seems the managedFields patching logic will not apply to the namespace resource in the normal restore process, it is skipped here https://github.com/vmware-tanzu/velero/blob/v1.14.1/pkg/restore/restore.go#L739. The only possibility the patching logic can apply to namespace is processing the additionalItems returns by the RestoreItemAction: https://github.com/vmware-tanzu/velero/blob/v1.14.1/pkg/restore/restore.go#L1393

Is there any RestoreItemAction in your deployment that returns the namespace as the additionalItems?

Please also check whether the conflict error is reported only for namespace resource or not in your following debugging.

@weshayutin
Copy link
Contributor

@shubham-pampattiwar please have a look for @mpryc when you have a moment.
Thank you

@shubham-pampattiwar
Copy link
Collaborator

@mpryc can you please help answer @ywk253100 's concerns/questions ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random race condition in the restore with managed fields
8 participants