-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct mistakes in offloaded timeline retain_lsn management #9760
Conversation
Signed-off-by: Alex Chi Z <[email protected]>
This doesn't actually reproduce any of the issues I've just fixed. Probably the timeline is still referenced somewhere and thus not dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; overall I feel we might want to revisit the retain_lsn mechanism, manipulating it in Drop
doesn't seem like a good idea... As we get more and more timeline states, in the future, probably it would be a good idea to re-compute it every time before gc/compaction?
I actually thought that this was already done, due to #9308. Personally I'd be fine with that as well. |
But this is done in |
yeah, that's what I wanted to say: i mistakenly assumed this when I wrote #9308. Now I know better :) |
14e6cfb
to
99a959a
Compare
5490 tests run: 5247 passed, 0 failed, 243 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cef74a5 at 2024-11-15T12:59:31.645Z :recycle: |
Not sure why we can read the basebackup, but shrug.
Documenting the test failures if I comment out the offloading-specific commented out
commented out
commented out
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 909f99ea9..27de8c2ba 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -542,7 +542,7 @@ fn from_timeline(timeline: &Timeline) -> Result<Self, UploadQueueNotReadyError>
let ancestor_lsn = timeline.get_ancestor_lsn();
let ancestor_timeline_id = ancestor_timeline.timeline_id;
let mut gc_info = ancestor_timeline.gc_info.write().unwrap();
- gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
+ //gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
(Some(ancestor_lsn), Some(ancestor_timeline_id))
} else {
(None, None)
@@ -1988,7 +1988,7 @@ async fn unoffload_timeline(
None => warn!("timeline already removed from offloaded timelines"),
}
- self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
+ //self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
Arc::clone(timeline)
};
@@ -3865,7 +3865,7 @@ fn initialize_gc_info(
return;
};
let ancestor_children = all_branchpoints.entry(*ancestor_timeline_id).or_default();
- ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
+ //ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
});
// The number of bytes we always keep, irrespective of PITR: this is a constant across timelines |
PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of
retain_lsn
values. However, there is more places than just tenant activation where one needs to update theretain_lsn
s.This PR fixes some bugs of the current code that could lead to corruption in the worst case:
retain_lsn
purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the tenant is not good either, because we might drop the offloaded timeline in a context where aoffloaded_timelines
lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline.retain_lsn
values populated, leading to it maybe garbage collecting values that its children might need. We now callinitialize_gc_info
on the parent.retain_lsn
values registered as offloaded at the parent. So if we drop theTimeline
object, and its registration is removed, the parent would not have any of the child'sretain_lsn
s around. Also, before, theTimeline
object would delete anything related to its timeline ID, now it only deletesretain_lsn
s that haveMaybeOffloaded::No
set.Incorporates Chi's reproducer from #9753. cc https://github.com/neondatabase/cloud/issues/20199
The
test_timeline_retain_lsn
test is extended:offload-parent
, which tests the second point, andoffload-no-restart
which tests the third point.It's easy to verify the test actually is "sharp" by removing one of the respective
self.initialize_gc_info()
,gc_info.insert_child()
orancestor_children.push()
.Part of #8088