Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Pageserver internal catalog #4636

Closed
LizardWizzard opened this issue Jul 5, 2023 · 3 comments
Closed

Epic: Pageserver internal catalog #4636

LizardWizzard opened this issue Jul 5, 2023 · 3 comments
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Jul 5, 2023

Motivation

Currently in pageserver we have to maintain some amount of metadata. Metadata is used to properly load tenants and timelines and deal with non-atomic actions (such as bootstrapping a timeline from scratch using initdb).

There are two parts of the problem. If we take a look on a pageserver with 100k tenants on it then loading process has to open ridiculous number of files. Pageserver is supposed to operate with high number of tenants attached to it, but with most of them not being active (i e without running compute).

It has to list <data_dir>/tenants directory, load tenant config which is a separate file. Then for each tenant we list its timelines directory to learn which timelines are there. Then for each timeline we load metadata file and list the directory to learn which layer files are there.

The second part of problem are so-called mark files. When we need to perform some non atomic actions we use specially named files which indicate that some operation had started so we know if this file is still present then the operation must've been interrupted. So if the operation was interrupted by a crash restart we can clean up the traces of unfinished operations, or resume it whatever is more appropriate.

Examples of mark files include TimelineUninitMark which is used during timeline creation, Tenant ignore mark file which is used to temporarily exclude tenant from working set, Tenant attaching mark which is used to continue interrupted attach operations. There are also temporary tenant directories which are used during tenant initialization.

Working with mark files is non trivial and cumbersome. Tenant timeline deletion is another example of that.

DoD

Problems above are solved. There are no mark files, and metadata is stored in a way that allows to quickly load it

Implementation ideas

  • We can use Timeline kv store to store metadata. This has the downside that currently this kv store is tied to postgres semantics. GC and compaction desnt work for anything except postgres pages.
  • Store a log, much like timeline manifest mentioned in the first link below. Needs some truncation strategy. We can groom it once in a while
  • Use SQLite or some other embedded solution. This particular one allows for cool introspection method. We can have an API method that allows to query this internal SQLite database. This can help for debugging/investigations

Note that there also needs to be some migration strategy. Whichever option we pick will need to have some gradual adoption plan.

The idea might be to concentrate needed apis in some struct Catalog first, and then modify implementation of the Catalog struct so it writes metadata into two places, and after that completely switch it so it uses only new solution.

Tasks

No tasks being tracked yet.

Other related tasks and Epics

@LizardWizzard LizardWizzard added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic a/tech_debt Area: related to tech debt labels Jul 5, 2023
@knizhnik
Copy link
Contributor

knizhnik commented Jul 6, 2023

This has the downside that currently this kv store is tied to postgres semantics. GC and compaction desnt work for anything except postgres pages.

KV storage consists of two parts - first one perform mapping of Postgres key (pgdatadir_mapping) and another one is pure KV storage where key and value are opaque.
Also GC and compaction has nothing to deal with Postgres pages - them operate with layers, not with pages.

The main drawback of using KV storage for storing this data is that this data is more or less temporary, while KV storage is persistent, including eviction to S3. But may be it is not so big problem...

Use SQLite or some other embedded solution.

There may be problems with concurrent access if hundreds of active tenants will try to access the same embedded engine. SQLite is very primitive concurrency control mechanism. But once again - it may be not a problem, if operation requiring access to this engine are quite rare.

@LizardWizzard
Copy link
Contributor Author

and another one is pure KV storage where key and value are opaque.

They are sort of opaque. The storage operate with Lsn's and updates will keep history which needs compaction which has different meaning from our current compaction. Our compaction expects postgres specific keys to be present in the keyspace, I think gc also does that.

We can make it work, but this will require changes to make our kv storage truly opaque. I wouldnt underestimate the amount of required changes. Imagine how many edge cases it will create because iof two significantly different modes (catalog vs usual tenant)

SQLite is very primitive concurrency control mechanism

Yes, I dont think this can be a problem. I would implement this as a Catalog actor which processes messages one by one. This should be pretty simple to reason about and load shouldnt be that high. This is all metadata after all

@jcsp
Copy link
Collaborator

jcsp commented Oct 28, 2024

We are adding a tenant-wide manifest as part of #8088 -- this can perhaps be extended in future to track non-archived timelines as well if we choose.

@jcsp jcsp closed this as completed Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

3 participants