-
Notifications
You must be signed in to change notification settings - Fork 4k
restore: fix retry revert stuck in reverting #156149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a82786a to
c2ecaee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where restore jobs would get stuck in an infinite retry loop when attempting to revert after being paused during cleanup. The fix introduces a flag to track whether descriptors have already been dropped during cleanup, preventing redundant drop attempts that would fail due to missing descriptors.
Key Changes:
- Added
dropped_descs_on_failflag toRestoreDetailsproto to track cleanup state - Modified
dropDescriptorsto skip cleanup if descriptors were already dropped - Added testing knob and comprehensive test to verify the fix
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/jobs/jobspb/jobs.proto | Added dropped_descs_on_fail boolean field to track descriptor cleanup state |
| pkg/backup/restore_job.go | Implemented early return in dropDescriptors when cleanup already completed and persisted the flag after successful cleanup |
| pkg/sql/exec_util_backup.go | Added AfterRevertRestoreDropDescriptors testing knob for pause injection during cleanup |
| pkg/backup/restore_test.go | Added regression test verifying restore jobs can recover from pauses during revert cleanup |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f09f957 to
5d27c3d
Compare
Previously, if a reverting restore job was paused after dropping its descriptors and then resumed, the restore job would get stuck in a retry loop due to missing descriptors. This commit adds a field in the restore details that can be set once descriptors have been dropped during restore cleanup. This allows us to avoid attempting to drop descriptors if they have already been dropped. Fixes: cockroachdb#156019 Release note: Restore no longer gets stuck in a retry loop when reverts are attempted twice.
5d27c3d to
c763374
Compare
| return nil | ||
| } | ||
|
|
||
| jobInfo := jobs.InfoStorageForJob(txn, r.job.ID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a checkpoint, how difficult would it be to make the cleanup logic idempotent so it is safe to run twice? Making the code naturally idempotent is my preferred solution for job retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original implementation was to make the code naturally idempotent. It required adding if pgerr.GetPGCode(err) == pgcode.UndefinedX checks in 4 or 5 places.
Since it required more reasoning to determine where to place these checks, it felt more susceptible to bugs if we were to ever come back and add more cleanup logic, which is why I went with the checkpoint approach.
I'm open to being convinced otherwise though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code I think you are right. This would probably need to be rewritten to be cleanly idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
TFTR! bors r=jeffswenson |
|
btw, i dont think we need to backport this change. It's not a trival backport and it's not a severe bug. |
Previously, if a reverting restore job was paused after dropping its descriptors and then resumed, the restore job would get stuck in a retry loop due to missing descriptors.
This commit adds a field in the restore details that can be set once descriptors have been dropped during restore cleanup. This allows us to avoid attempting to drop descriptors if they have already been dropped.
Fixes: #156019
Release note: Restore no longer gets stuck in a retry loop when reverts are attempted twice.