-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Pageserver internal catalog #4636
Comments
KV storage consists of two parts - first one perform mapping of Postgres key (pgdatadir_mapping) and another one is pure KV storage where key and value are opaque. The main drawback of using KV storage for storing this data is that this data is more or less temporary, while KV storage is persistent, including eviction to S3. But may be it is not so big problem...
There may be problems with concurrent access if hundreds of active tenants will try to access the same embedded engine. SQLite is very primitive concurrency control mechanism. But once again - it may be not a problem, if operation requiring access to this engine are quite rare. |
They are sort of opaque. The storage operate with Lsn's and updates will keep history which needs compaction which has different meaning from our current compaction. Our compaction expects postgres specific keys to be present in the keyspace, I think gc also does that. We can make it work, but this will require changes to make our kv storage truly opaque. I wouldnt underestimate the amount of required changes. Imagine how many edge cases it will create because iof two significantly different modes (catalog vs usual tenant)
Yes, I dont think this can be a problem. I would implement this as a Catalog actor which processes messages one by one. This should be pretty simple to reason about and load shouldnt be that high. This is all metadata after all |
We are adding a tenant-wide manifest as part of #8088 -- this can perhaps be extended in future to track non-archived timelines as well if we choose. |
Motivation
Currently in pageserver we have to maintain some amount of metadata. Metadata is used to properly load tenants and timelines and deal with non-atomic actions (such as bootstrapping a timeline from scratch using initdb).
There are two parts of the problem. If we take a look on a pageserver with 100k tenants on it then loading process has to open ridiculous number of files. Pageserver is supposed to operate with high number of tenants attached to it, but with most of them not being active (i e without running compute).
It has to list <data_dir>/tenants directory, load tenant config which is a separate file. Then for each tenant we list its
timelines
directory to learn which timelines are there. Then for each timeline we loadmetadata
file and list the directory to learn which layer files are there.The second part of problem are so-called mark files. When we need to perform some non atomic actions we use specially named files which indicate that some operation had started so we know if this file is still present then the operation must've been interrupted. So if the operation was interrupted by a crash restart we can clean up the traces of unfinished operations, or resume it whatever is more appropriate.
Examples of mark files include TimelineUninitMark which is used during timeline creation, Tenant ignore mark file which is used to temporarily exclude tenant from working set, Tenant attaching mark which is used to continue interrupted attach operations. There are also temporary tenant directories which are used during tenant initialization.
Working with mark files is non trivial and cumbersome. Tenant timeline deletion is another example of that.
DoD
Problems above are solved. There are no mark files, and metadata is stored in a way that allows to quickly load it
Implementation ideas
Note that there also needs to be some migration strategy. Whichever option we pick will need to have some gradual adoption plan.
The idea might be to concentrate needed apis in some
struct Catalog
first, and then modify implementation of the Catalog struct so it writes metadata into two places, and after that completely switch it so it uses only new solution.Tasks
Other related tasks and Epics
The text was updated successfully, but these errors were encountered: