restore: fix retry revert stuck in reverting #156149

kev-cao · 2025-10-25T01:38:08Z

Previously, if a reverting restore job was paused after dropping its descriptors and then resumed, the restore job would get stuck in a retry loop due to missing descriptors.

This commit adds a field in the restore details that can be set once descriptors have been dropped during restore cleanup. This allows us to avoid attempting to drop descriptors if they have already been dropped.

Fixes: #156019

Release note: Restore no longer gets stuck in a retry loop when reverts are attempted twice.

cockroach-teamcity · 2025-10-25T01:38:20Z

This change is

Copilot

Pull Request Overview

This PR fixes a bug where restore jobs would get stuck in an infinite retry loop when attempting to revert after being paused during cleanup. The fix introduces a flag to track whether descriptors have already been dropped during cleanup, preventing redundant drop attempts that would fail due to missing descriptors.

Key Changes:

Added dropped_descs_on_fail flag to RestoreDetails proto to track cleanup state
Modified dropDescriptors to skip cleanup if descriptors were already dropped
Added testing knob and comprehensive test to verify the fix

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
pkg/jobs/jobspb/jobs.proto	Added `dropped_descs_on_fail` boolean field to track descriptor cleanup state
pkg/backup/restore_job.go	Implemented early return in `dropDescriptors` when cleanup already completed and persisted the flag after successful cleanup
pkg/sql/exec_util_backup.go	Added `AfterRevertRestoreDropDescriptors` testing knob for pause injection during cleanup
pkg/backup/restore_test.go	Added regression test verifying restore jobs can recover from pauses during revert cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/backup/restore_job.go

Previously, if a reverting restore job was paused after dropping its descriptors and then resumed, the restore job would get stuck in a retry loop due to missing descriptors. This commit adds a field in the restore details that can be set once descriptors have been dropped during restore cleanup. This allows us to avoid attempting to drop descriptors if they have already been dropped. Fixes: cockroachdb#156019 Release note: Restore no longer gets stuck in a retry loop when reverts are attempted twice.

jeffswenson · 2025-10-28T17:44:02Z

pkg/backup/restore_job.go

 		return nil
 	}

+	jobInfo := jobs.InfoStorageForJob(txn, r.job.ID())


Instead of adding a checkpoint, how difficult would it be to make the cleanup logic idempotent so it is safe to run twice? Making the code naturally idempotent is my preferred solution for job retries.

My original implementation was to make the code naturally idempotent. It required adding if pgerr.GetPGCode(err) == pgcode.UndefinedX checks in 4 or 5 places.

Since it required more reasoning to determine where to place these checks, it felt more susceptible to bugs if we were to ever come back and add more cleanup logic, which is why I went with the checkpoint approach.

I'm open to being convinced otherwise though.

Looking through the code I think you are right. This would probably need to be rewritten to be cleanly idempotent.

jeffswenson

LGTM

kev-cao · 2025-10-28T18:51:15Z

TFTR!

bors r=jeffswenson

msbutler · 2025-10-28T19:09:21Z

btw, i dont think we need to backport this change. It's not a trival backport and it's not a severe bug.

craig · 2025-10-28T20:08:46Z

Build succeeded:

kev-cao requested review from a team as code owners October 25, 2025 01:38

kev-cao requested review from jeffswenson and removed request for a team October 25, 2025 01:38

msbutler self-requested a review October 25, 2025 02:31

kev-cao force-pushed the restore/retry-revert branch from a82786a to c2ecaee Compare October 25, 2025 02:46

kev-cao requested a review from Copilot October 25, 2025 02:46

Copilot AI reviewed Oct 25, 2025

View reviewed changes

pkg/backup/restore_job.go Show resolved Hide resolved

kev-cao force-pushed the restore/retry-revert branch 7 times, most recently from f09f957 to 5d27c3d Compare October 28, 2025 14:30

kev-cao added backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Oct 28, 2025

kev-cao force-pushed the restore/retry-revert branch from 5d27c3d to c763374 Compare October 28, 2025 16:49

jeffswenson reviewed Oct 28, 2025

View reviewed changes

jeffswenson approved these changes Oct 28, 2025

View reviewed changes

msbutler removed backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Oct 28, 2025

craig bot merged commit de81162 into cockroachdb:master Oct 28, 2025
24 checks passed

celeste-cockroachdb bot added the target-release-26.1.0 label Oct 28, 2025

kev-cao mentioned this pull request Oct 30, 2025

restore: do not fail on missing descriptors during restore cleanup #156019

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

restore: fix retry revert stuck in reverting #156149

restore: fix retry revert stuck in reverting #156149

Uh oh!

kev-cao commented Oct 25, 2025

Uh oh!

cockroach-teamcity commented Oct 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

jeffswenson Oct 28, 2025

Uh oh!

kev-cao Oct 28, 2025

Uh oh!

jeffswenson Oct 28, 2025

Uh oh!

jeffswenson left a comment

Uh oh!

kev-cao commented Oct 28, 2025

Uh oh!

msbutler commented Oct 28, 2025

Uh oh!

craig bot commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

restore: fix retry revert stuck in reverting #156149

restore: fix retry revert stuck in reverting #156149

Uh oh!

Conversation

kev-cao commented Oct 25, 2025

Uh oh!

cockroach-teamcity commented Oct 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

jeffswenson Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

kev-cao Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

jeffswenson Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

jeffswenson left a comment

Choose a reason for hiding this comment

Uh oh!

kev-cao commented Oct 28, 2025

Uh oh!

msbutler commented Oct 28, 2025

Uh oh!

craig bot commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants