Skip to content

Conversation

@kev-cao
Copy link
Contributor

@kev-cao kev-cao commented Oct 25, 2025

Previously, if a reverting restore job was paused after dropping its descriptors and then resumed, the restore job would get stuck in a retry loop due to missing descriptors.

This commit adds a field in the restore details that can be set once descriptors have been dropped during restore cleanup. This allows us to avoid attempting to drop descriptors if they have already been dropped.

Fixes: #156019

Release note: Restore no longer gets stuck in a retry loop when reverts are attempted twice.

@kev-cao kev-cao requested review from a team as code owners October 25, 2025 01:38
@kev-cao kev-cao requested review from jeffswenson and removed request for a team October 25, 2025 01:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler self-requested a review October 25, 2025 02:31
@kev-cao kev-cao force-pushed the restore/retry-revert branch from a82786a to c2ecaee Compare October 25, 2025 02:46
@kev-cao kev-cao requested a review from Copilot October 25, 2025 02:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where restore jobs would get stuck in an infinite retry loop when attempting to revert after being paused during cleanup. The fix introduces a flag to track whether descriptors have already been dropped during cleanup, preventing redundant drop attempts that would fail due to missing descriptors.

Key Changes:

  • Added dropped_descs_on_fail flag to RestoreDetails proto to track cleanup state
  • Modified dropDescriptors to skip cleanup if descriptors were already dropped
  • Added testing knob and comprehensive test to verify the fix

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/jobs/jobspb/jobs.proto Added dropped_descs_on_fail boolean field to track descriptor cleanup state
pkg/backup/restore_job.go Implemented early return in dropDescriptors when cleanup already completed and persisted the flag after successful cleanup
pkg/sql/exec_util_backup.go Added AfterRevertRestoreDropDescriptors testing knob for pause injection during cleanup
pkg/backup/restore_test.go Added regression test verifying restore jobs can recover from pauses during revert cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kev-cao kev-cao force-pushed the restore/retry-revert branch 7 times, most recently from f09f957 to 5d27c3d Compare October 28, 2025 14:30
@kev-cao kev-cao added backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Oct 28, 2025
Previously, if a reverting restore job was paused after dropping its
descriptors and then resumed, the restore job would get stuck in a retry
loop due to missing descriptors.

This commit adds a field in the restore details that can be set once
descriptors have been dropped during restore cleanup. This allows us to
avoid attempting to drop descriptors if they have already been dropped.

Fixes: cockroachdb#156019

Release note: Restore no longer gets stuck in a retry loop when reverts
are attempted twice.
@kev-cao kev-cao force-pushed the restore/retry-revert branch from 5d27c3d to c763374 Compare October 28, 2025 16:49
return nil
}

jobInfo := jobs.InfoStorageForJob(txn, r.job.ID())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a checkpoint, how difficult would it be to make the cleanup logic idempotent so it is safe to run twice? Making the code naturally idempotent is my preferred solution for job retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original implementation was to make the code naturally idempotent. It required adding if pgerr.GetPGCode(err) == pgcode.UndefinedX checks in 4 or 5 places.

Since it required more reasoning to determine where to place these checks, it felt more susceptible to bugs if we were to ever come back and add more cleanup logic, which is why I went with the checkpoint approach.

I'm open to being convinced otherwise though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code I think you are right. This would probably need to be rewritten to be cleanly idempotent.

Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kev-cao
Copy link
Contributor Author

kev-cao commented Oct 28, 2025

TFTR!

bors r=jeffswenson

@msbutler
Copy link
Collaborator

btw, i dont think we need to backport this change. It's not a trival backport and it's not a severe bug.

@msbutler msbutler removed backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Oct 28, 2025
@craig
Copy link
Contributor

craig bot commented Oct 28, 2025

@craig craig bot merged commit de81162 into cockroachdb:master Oct 28, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

restore: do not fail on missing descriptors during restore cleanup

4 participants