Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Pageserver Timeline Archival #8088

Open
27 of 45 tasks
jcsp opened this issue Jun 18, 2024 · 8 comments
Open
27 of 45 tasks

Epic: Pageserver Timeline Archival #8088

jcsp opened this issue Jun 18, 2024 · 8 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Collaborator

jcsp commented Jun 18, 2024

Purpose

Enable users to create branches fearlessly, without worrying about hitting branch count limits & without having to worry about cleaning up old branches unless they want to.

Background

Currently, all timelines have significant physical overhead on the pageserver, even if they haven't been used for days/weeks/months:

  • scanning timeline's remote storage path on tenant startup & load their index
  • pinning some of the timeline's layers into local storage for logical size calculations
  • running a wal receiver for the timeline

Changes

This section isn't an authoritative design, but calls out functional areas that will need work.

  • We'll need some manifest in remote storage that the tenant can read on startup to learn which timelines should be loaded in an active state, vs. which timelines are hibernated. Keeping this properly up to date with timeline create/delete operations will be a key correctness point.
  • Persist enough information about hibernated timelines that we can know their logical size (& any other key stats) without having to load them fully. It probably makes sense to inline this into the per-tenant object that lists the timelines.
  • Our runtime state in Tenant will need to only store active timelines in Tenant::timelines, and have some other map of hibernated timelines.
  • APIs that list timelines will need either to change their semantics to only report active timelines, to avoid unreasonably large responses when users have many thousands of branches -- or paginated/queryable. Bu
  • An external API to enable the control plane to tell us when a timeline should be hibernated or awoken. We could also choose to auto-hibernate after some period of inactivity, but that might be duplicative wrt the externally driven mechanism.`.
  • A cache-warming routine that loads enough layers to serve reads at the tip of the branch, so that when we activate a timeline, the user doesn't encounter a long slow period while data is promoted to local storage.

Tasks

  1. c/storage/pageserver t/feature
    jcsp
  2. a/tech_debt c/storage/pageserver
  3. arpad-m
  4. arpad-m
  5. arpad-m
  6. c/storage/pageserver
  7. c/storage/pageserver t/feature
    jcsp
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Jun 18, 2024
@jcsp jcsp changed the title Epic: Pageserver Timeline Hibernation Epic: Pageserver Timeline Archival Jul 1, 2024
jcsp added a commit that referenced this issue Jul 3, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
jcsp added a commit that referenced this issue Jul 11, 2024
A design for a cheap low-resource state for idle timelines:
- #8088
skyzh pushed a commit that referenced this issue Jul 15, 2024
A design for a cheap low-resource state for idle timelines:
- #8088
arpad-m added a commit that referenced this issue Jul 19, 2024
This adds an archival_config endpoint to the pageserver. Currently it
has no effect, and always "works", but later the intent is that it will
make a timeline archived/unarchived.

- [x] add yml spec
- [x] add endpoint handler

Part of #8088
problame pushed a commit that referenced this issue Jul 22, 2024
This adds an archival_config endpoint to the pageserver. Currently it
has no effect, and always "works", but later the intent is that it will
make a timeline archived/unarchived.

- [x] add yml spec
- [x] add endpoint handler

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Jul 22, 2024

arpad-m added a commit that referenced this issue Jul 22, 2024
arpad-m added a commit that referenced this issue Jul 27, 2024
Persists whether a timeline is archived or not in `index_part.json`. We
only return success if the upload has actually worked successfully.

Also introduces a new `index_part.json` version number.

Fixes #8459

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Aug 19, 2024

This week:

arpad-m added a commit that referenced this issue Oct 21, 2024
Add a way to list the offloaded timelines.

Before, one had to look at logs to figure out if a timeline has been
offloaded or not, or use the non-presence of a certain timeline in the
list of normal timelines. Now, one can list them directly.
 
Part of #8088
arpad-m added a commit that referenced this issue Oct 22, 2024
Persist timeline offloaded state to S3.

Right now, as of #8907, at each restart of the pageserver, all offloaded
state is lost, so we load the full timeline again. As it starts with an
empty local directory, we might potentially download some files again,
leading to downloads that are ultimately wasteful.

This patch adds support for persisting the offloaded state, allowing us
to never load offloaded timelines in the first place. The persistence
feature is facilitated via a new file in S3 that is tenant-global, which
contains a list of all offloaded timelines. It is updated each time we
offload or unoffload a timeline, and otherwise never touched.

This choice means that tenants where no offloading is happening will not
immediately get a manifest, keeping the change very minimal at the
start.

We leave generation support for future work. It is important to support
generations, as in the worst case, the manifest might be overwritten by
an older generation after a timeline has been unoffloaded (and
unarchived), so the next pageserver process instantiation might wrongly
believe that some timeline is still offloaded even though it should be
active.

Part of #9386, #8088
arpad-m added a commit that referenced this issue Oct 25, 2024
Before, we didn't copy over the `index-part.json` of offloaded timelines
to the new shard's location, resulting in the new shard not knowing the
timeline even exists.

In #9444, we copy over the manifest, but we also need to do this for
`index-part.json`.

As the operations to do are mostly the same between offloaded and
non-offloaded timelines, we can iterate over all of them in the same
loop, after the introduction of a `TimelineOrOffloadedArcRef` type to
generalize over the two cases. This is analogous to the deletion code
added in #8907.

The added test also ensures that the sharded archival config endpoint
works, something that has not yet been ensured by tests.

Part of #8088
arpad-m added a commit that referenced this issue Oct 25, 2024
This PR does two things:

1. Obtain a `TimelineCreateGuard` object in `unoffload_timeline`. This
prevents two unoffload tasks from racing with each other. While they
already obtain locks for `timelines` and `offloaded_timelines`, they
aren't sufficient, as we have already constructed an entire timeline at
that point. We shouldn't ever have two `Timeline` objects in the same
process at the same time.
2. don't allow timeline creations for timelines that have been
offloaded. Obviously they already exist, so we should not allow
creation. the previous logic only looked at the timelines list.

Part of #8088
arpad-m added a commit that referenced this issue Oct 26, 2024
Archived timelines should not count towards synthetic size.

Closes #9384.

Part of #8088.
@arpad-m
Copy link
Member

arpad-m commented Oct 28, 2024

Last week:

This week:

  • after 9519 merges, enable offloading on staging
  • test for retain_lsn functionality of offloaded branches
  • return error if trying to detach ancestor of archived or offloaded timeline, demanding all children to be unarchived first.
  • make deletion code check that the timeline also has no offloaded children in addition to no non-offloaded ones

arpad-m added a commit that referenced this issue Oct 29, 2024
Currently, all callers of `unoffload_timeline` ensure that the tenant
the unoffload operation is called on is active. We rely on it being
active as we activate the timeline below and don't want to race with the
activation code of the tenant (in the worst case, activating a timeline
twice).

Therefore, add this assertion.

Part of #8088
jcsp pushed a commit that referenced this issue Oct 29, 2024
As pointed out in
#9489 (comment) ,
we currently didn't support deletion for offloaded timelines after the
timeline has been loaded from the manifest instead of having been
offloaded.

This was because the upload queue hasn't been initialized yet. This PR
thus initializes the timeline and shuts it down immediately.

Part of #8088
arpad-m added a commit that referenced this issue Oct 30, 2024
Disallow a request for timeline ancestor detach if either the to be
detached timeline, or any of the to be reparented timelines are
offloaded or archived.

In theory we could support timelines that are archived but not
offloaded, but archived timelines are at the risk of being offloaded, so
we treat them like offloaded timelines. As for offloaded timelines, any
code to "support" them would amount to unoffloading them, at which point
we can just demand to have the timelines be unarchived.

Part of #8088
arpad-m added a commit that referenced this issue Oct 30, 2024
Constructing a remote client is no big deal. Yes, it means an extra
download from S3 but it's not that expensive. This simplifies code paths
and scenarios to test. This unifies timelines that have been recently
offloaded with timelines that have been offloaded in an earlier
invocation of the process.

Part of #8088
arpad-m added a commit that referenced this issue Oct 30, 2024
If we delete a timeline that has childen, those children will have their
data corrupted. Therefore, extend the already existing safety check to
offloaded timelines as well.

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Nov 4, 2024

Last week I made and merged a lot of pull requests. Although some of them are quite small, they fix a lot of possible misuses/edge cases that can lead to corruption:

There has also been work by John for #9386, to make manifests more robust/generation ready:

This week:

  • merge the two open PRs to add a test and make it possible to enable offloading for single tenants
  • scrubber changes for persistence for offloaded state #9386, also to delete old generations of manifest
  • monitor staging and see if there is corruptions

So it's going really well, and work is mostly complete. Now the main task is to ensure it rolls out safely, with us reducing the impact of any possible issue by doing a staged rollout.

arpad-m added a commit that referenced this issue Nov 4, 2024
Allow us to enable timeline offloading for single tenants without having
to enable it for the entire pageserver.

Part of #8088.
@arpad-m
Copy link
Member

arpad-m commented Nov 11, 2024

last week:

this week:

I'll focus on the scrubber side of #9386, and continuing to analyze tenants.

arpad-m added a commit that referenced this issue Nov 11, 2024
Add a test that ensures the `retain_lsn` functionality works. Right now,
there is not a single test that is broken if offloaded or non-offloaded
timelines don't get registered at their parents, preventing gc from
discarding the ancestor_lsns of the children. This PR fills that gap.

The test has four modes:

* `offloaded`: offload the child timeline, run compaction on the parent
timeline, unarchive the child timeline, then try reading from it.
hopefully the data is still there.
* `offloaded-corrupted`: offload the child timeline, corrupts the
manifest in a way that the pageserver believes the timeline was
flattened. This is the closest we can get to pretend the `retain_lsn`
mechanism doesn't exist for offloaded timelines, so we can avoid adding
endpoints to the pageserver that do this manually for tests. The test
then checks that indeed data is corrupted and the endpoint can't be
started. That way we know that the test is actually working, and
actually tests the `retain_lsn` mechanism, instead of say the lsn lease
mechanism, or one of the many other mechanisms that impede gc.
* `archived`: the child timeline gets archived but doesn't get
offloaded. this currently matches the `None` case but we might have
refactors in the future that make archived timelines sufficiently
different from non-archived ones.
* `None`: the child timeline doesn't even get archived. this tests that
normal timelines participate in `retain_lsn`. I've made them locally not
participate in `retain_lsn` (via commenting out the respective
`ancestor_children.push` statement in tenant.rs) and ran the testsuite,
and not a single test failed. So this test is first of its kind.

Part of #8088.
arpad-m added a commit that referenced this issue Nov 15, 2024
PR #9308 has modified tenant activation code to take offloaded child
timelines into account for populating the list of `retain_lsn` values.
However, there is more places than just tenant activation where one
needs to update the `retain_lsn`s.

This PR fixes some bugs of the current code that could lead to
corruption in the worst case:

1. Deleting of an offloaded timeline would not get its `retain_lsn`
purged from its parent. With the patch we now do it, but as the parent
can be offloaded as well, the situatoin is a bit trickier than for
non-offloaded timelines which can just keep a pointer to their parent.
Here we can't keep a pointer because the parent might get offloaded,
then unoffloaded again, creating a dangling pointer situation. Keeping a
pointer to the *tenant* is not good either, because we might drop the
offloaded timeline in a context where a `offloaded_timelines` lock is
already held: so we don't want to acquire a lock in the drop code of
OffloadedTimeline.
2. Unoffloading a timeline would not get its `retain_lsn` values
populated, leading to it maybe garbage collecting values that its
children might need. We now call `initialize_gc_info` on the parent.
3. Offloading of a timeline would not get its `retain_lsn` values
registered as offloaded at the parent. So if we drop the `Timeline`
object, and its registration is removed, the parent would not have any
of the child's `retain_lsn`s around. Also, before, the `Timeline` object
would delete anything related to its timeline ID, now it only deletes
`retain_lsn`s that have `MaybeOffloaded::No` set.

Incorporates Chi's reproducer from #9753. cc
neondatabase/cloud#20199

The `test_timeline_retain_lsn` test is extended:

1. it gains a new dimension, duplicating each mode, to either have the
"main" branch be the direct parent of the timeline we archive, or the
"test_archived_parent" branch intermediary, creating a three timeline
structure. This doesn't test anything fixed by this PR in particular,
just explores the vast space of possible configurations a little bit
more.
2. it gains two new modes, `offload-parent`, which tests the second
point, and `offload-no-restart` which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the
respective `self.initialize_gc_info()`, `gc_info.insert_child()` or
`ancestor_children.push()`.

Part of #8088

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Alex Chi Z <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

2 participants