Storage design and specification #201

Freyskeyd · 2024-08-02T08:53:15Z

Freyskeyd
Aug 2, 2024
Maintainer

This discussion is here to expose a first iteration on the Storage design for the AggLayer.

The goal is to define how we store the data and how the data is represented inside a node.
This discussion doesn't take into account some topics such as synchronization, Merkle tree storing performance, versioning and Store API design.

Introduction

To gather and store the data related to the AggLayer and the pessimistic-proof that are being managed, we'll rely on RocksDB.
The goal of this first iteration is to allow the AggLayer to be fault-tolerant regarding crashes, reboot and redeployment.
Another condition was to have such storage as an embedded storage, to not rely on managed or external service.

The access to the data should be as simple as possible and should be quick.

The rest of the document will have this glossary, to which you can refer any time:

key: The key pointing to a value
value: The value linked to a key
CF or Column Family: Logical partition of the database
database: An instance of RocksDB database
network: defines an actor that can send a certificate to the AggLayer (i.e.: a CDK chain)
height: a nonce that helps to define the place of a certificate in the list of certificate sent by one network.
epoch: a logical window that has a duration based on block number and in which certificate get placed to be settled
proof: a proof generated by calling a prover with an input (certificate), as a result a ProvenCertificate is a certificate that has an associated proof
certificate_id: A hash that uniquely identify one certificate

Defining the needs

In order to store data with efficiency, we need to both have quick access to important metadata and access to more heavy data if needed. We also need to have the capability to get and check the existence of keys quickly.

We also want to achieve those kinds of actions:

For one network and for a particular height, being able to fetch the certificate_id and the epoch it's settled in
For one certificate_id, being able to retrieve a CertificateHeader which consist of: height, epoch_number, certificate_index, proof_index and network_id
Being able to get the latest settled epoch number
Being able to queue UnprovenCertificate per network and height
For one certificate_id, being able to fetch a generated proof for a certificate_id
Being able to store SMT and local_exit_tree

The different logical stores

This discussion doesn't aim at defining in depth those stores, but we need to have some components to check if the storage can be used in a consistent way.

Based on the actions defined above, we can define multiple "store". By Store I mean a logical entity that will be responsible for executing and managing resources to offer higher level functions to access the data. However, the store isn't owning the data, he just facilitates the access to it.

We can list those store:

State that contains critical and mandatory data for the AggLayer to work
Pending for everything that is related to the pending queues
Metadata for persistent information related to the node itself and not the data it manages
Index for data that is not critical but that can facilitate interoperability
PerEpoch for information and data related to an Epoch

All stores are instantiated once, except for the PerEpoch that is instantiated for each epoch.
If we take the list of actions, we could assign one action to one or multiple store:

For one network and for a particular height, being able to fetch the certificate_id and the epoch it's settled in

Knowing if a certificate exists at a particular height for a network is critical in order to accept or deny an incoming certificate. This action could be fully handled by the StateStore

For one certificate_id, being able to retrieve a CertificateHeader which consist of: height, epoch_number, certificate_index, proof_index and network_id

The CertificateHeader isn't critical for the AggLayer itself, as it doesn't really need to use this information. But this can be really useful for an external component to fetch it. The IndexStore seems to be the best place for that, the certificate headers could also be moved to a completely different storage mechanism if needed.

Being able to get the latest settled epoch number

When the AggLayer reboots or start, it needs to know quickly which Epoch has been settled, the MetadataStore seems to be the one for that. (This information can maybe be fetched on L1, but for the sake of simplicity, we'll store it for now)

Being able to queue UnprovenCertificate per network and height

UnprovenCertificate are certificate that are verified but not yet proven, they are not part of the state yet nor part of an epoch. This seems to be a PendingStore candidate.

For one certificate_id, being able to fetch a generated proof for a certificate_id

This is a particular action which defines that we have both a certificate and a proof that can be settled into an epoch.
There are two possible solutions:
- We keep generating proofs when receiving certificate, and It's not directly link to an epoch: PendingStore
- We generate proofs for certificate only when having an epoch to put them into: PerEpochStore

Being able to store SMT and local_exit_tree

As those trees are important, they fall into StateStore

The different physical storage

As defined above, the Store will not own the data, the Storage will. I define Storage as the physical storage that own the data and that can be persisted. The Storage is a combination of one or multiple database (RocksDB instance) and an abstraction layer.

This abstraction layer is of two kinds:

DB layer, which abstract the API/Interface of the RocksDB database
Columns layer, which define the different CF that are used by the Store to interact with the DB

The DB layer will not be covered here as it is purely technical implementation.
For the Columns it could be interesting to define a first iteration of how those CF are distributed across the database.

Data structures involved

Before diving into the CF definition, let's define what we'll have to store. (This information can change after this discussion, but I will use them for the rest of the doc)

The database needs to hold:

Certificate bytes
Proofs bytes
The relation between certificate and network
The relation between certificate and epoch
The pending pool of unproven certificate
AggLayer instance metadata
The different trees related to the pessimistic-proof

CFs definition

In this section, I will explain the first design of the CFs that can be used by the stores:

For this section, some key or value are defined using parentheses, it means that the key or value contains multiple "value", but the whole thing is serialized into one single bytes array. Double-quoted values or keys means that this is plain text encoded. Multiple occurrences of Key -> Value defines multiple possible key formats inside the same CF

cerfificate_per_network

Key → value: (network_id, height) => (certificate_id, epoch_number, certificate_index)

This CF stores for each network the settled certificate_id, with the associated epoch and epoch index.

certificate_header

Key → value: certificate_id => (network_id, height, epoch_number, certificate_index)

This CF stores for each certificate_id the CertificateHeader.

latest_settled_certificate_per_network

Key → value: network_id => (certificate_id, height, epoch_number)

This CF stores for each network_id the latest settled certificate_id with the height and epoch information.

metadata

Key → value: metadata_key => metdata_value

This CF stores all the metadata of the AggLayer instance.
For now, I can think of one important one:

latest_settled_epoch => epoch_number

proof_per_certificate

Key → value: certificate_id => proof bytes array

This CF stores for each certificate_id the associated proof bytes.

pending_queue

Key → value: (network_id, height) => [certificate bytes array, ...]

This CF stores the pending queue of UnprovenCertificate.
Currently, the value is an array of certificate bytes, as we could receive multiple concurrent certificates for the same height.
This could be optimized in a next iteration.

local_exit_tree_per_network

Key → value: (network_id, layer, index) => bytes array

This CF stores for each network the local exit tree. This could be replaced by an SMT.

nullifier_tree_per_network

Key → value: (network_id, hash(node)) => ( hash(node.left), hash(node.right) ) || hash(leaf)
Key → value: (network_id, "root") => hash(node.left), hash(node.right) )
Key → value: (network_id, "leaves", hash(left)) => leaf bytes array

This CF stores for each network the nullifier tree with its associated root and leaves.

balance_tree_per_network

Key → value: (network_id, hash(node)) => ( hash(node.left), hash(node.right) ) || hash(leaf)
Key → value: (network_id, "root") => hash(node.left), hash(node.right) )
Key → value: (network_id, "leaves", hash(left)) => leaf bytes array

This CF stores for each network the balance tree with its associated root and leaves.

proofs_per_epoch

Key → value: (epoch_number, proof_index) => proof bytes array

This CF stores for each epoch the list of proof generated.

certificates_per_epoch

Key → value: (epoch_number, certificate_index) => certificate bytes array

This CF stores for each epoch the list of certificate settled.

metadata_per_epoch

Key → value: (epoch_number, metadata_key) => metadata_value

This CF stores for each epoch the different metadata related to that epoch.

Currently I can think of those metadatas:

epoch_number -> N
tx_hash -> transaction hash settled on L1
number_of_certificates -> N

Assigning CFs to physical database

After defining the CFs we can assign them to different database.

At first we can see that the certificate bytes array and proof bytes array can only be found in the pending_queue and the certificates_per_epoch for certificates and proof_per_certificate and proofs_per_epoch for proofs. It means that those type can be in two state:

UnprovenCertificate: Certificate without an associated proof or unassociated to an epoch
ProvenCertificate: Certificate with an associated proof and associated to an epoch

UnprovenCertificate could be placed into a database name pending, while ProvenCertificate could be placed into a "per_epoch" database to clusterize the data. A per_epoch database will hold everything related to an epoch, after an epoch is closed, this database can be placed in a read-only mode, can be archived and even prune. This could prevent the State of the agglayer to aggregate too much data in one single database.

Indexes and metadata are things that are not really mandatory but really close, we've two choices:

Split and have two database, one for indexes and the other for metadata
A single database which contains both

In any case, this database will own: metadata, latest_settled_certificate_per_network, certificate_header

The next big one is containing everything needed to perform the AggLayer work, we can define a state database that contains all of that. The representation would be that:

.
├── metadata
├── pending
├── per_epoch
│   ├── 1
│   └── 2
│   └── ...
└── state

If we assign the CFs to those databases:

.
├── metadata
│   ├── certificate_header
│   ├── latest_settled_certificate_per_network
│   └── metadata
├── pending
│   ├── pending_queue
│   └── proof_per_certificate
├── per_epoch
│   ├── 1
│   │   ├── certificates
│   │   ├── metadata
│   │   └── proofs
│   └── 2
│       ├── certificates
│       ├── metadata
│       └── proofs
│   └── ...
└── state
    ├── balance_tree_per_network
    ├── cerfificate_per_network
    ├── local_exit_tree_per_network
    └── nullifier_tree_per_network

Conclusion

This document is way too long, I know. If you have any questions or remark, feel free to address them !
For me, there are still pending questions:

Should we merge metadata and indexes into a single database ?
Does latest_settled_certificate_per_network belongs in the state database or in the metadata|indexes ?
This document doesn't contain any information about data versioning and such, maybe we should integrate that later but not too late, what do you think?

muursh · 2024-08-02T15:19:40Z

muursh
Aug 2, 2024

All in all this looks great. A few small comments.

Does latest_settled_certificate_per_network belongs in the state database or in the metadata|indexes ?

I would argue state

Should we merge metadata and indexes into a single database ?

Probably tbh. I'm not convinced they need separation.

This document doesn't contain any information about data versioning and such, maybe we should integrate that later but not too late, what do you think?

A problem for another day :)

UnprovenCertificate could be placed into a database name pending

A slightly related issue is what Gabe just mentioned around warning chains about queued proofs.

per_epoch

I'm not sold on the name, but that's not the end of the world. In its current form per_epoch feels slightly confusing to me as you mean a given specific epoch rather than something occurring over multiple epochs.

after an epoch is closed, this database can be placed in a read-only mode, can be archived and even prune

I feel we don't want to do this immediately but after n epochs.

4 replies

Freyskeyd Aug 3, 2024
Maintainer Author

Probably tbh. I'm not convinced they need separation.
I would argue state
A problem for another day :)

👍🏼

UnprovenCertificate could be placed into a database name pending

A slightly related issue is what Gabe just mentioned around warning chains about queued proofs.

Not really, the pending queue is here to save certificate until we have a proof to put it into an epoch.

per_epoch

I'm not sold on the name, but that's not the end of the world. In its current form per_epoch feels slightly confusing to me as you mean a given specific epoch rather than something occurring over multiple epochs.

In terms of database, it'll be one database per epoch.
It will allow us to prune old data and even allow client to synchronize only the latest X epochs.
But if we feel that it is better to have a pack of epoch, we also can refactor that later I think.

after an epoch is closed, this database can be placed in a read-only mode, can be archived and even prune
I feel we don't want to do this immediately but after n epochs.

Indeed but the design allow us those possibilities later

muursh Aug 5, 2024

In terms of database, it'll be one database per epoch.
It will allow us to prune old data and even allow client to synchronize only the latest X epochs.
But if we feel that it is better to have a pack of epoch, we also can refactor that later I think.

Agreed. I'm not against how it works, just the name. But as I said it's not exactly the end of the world to me.

Not really, the pending queue is here to save certificate until we have a proof to put it into an epoch.

Yeah, this is why I said "slightly related" :). It was just fresh in my mind. It doesn't interact with what you're discussing here but we should open an issue.

Freyskeyd Aug 5, 2024
Maintainer Author

Agreed. I'm not against how it works, just the name. But as I said it's not exactly the end of the world to me.

I've thought about using epochs as a database and PerEpochStore for the store use to interact with it. Wdyt?

muursh Aug 5, 2024

Yeah sounds good

iljakuklic · 2024-10-03T17:45:29Z

iljakuklic
Oct 3, 2024
Collaborator

Following up our discussion about evolving the database schema over time. Let me know what do you think of the following design.

The idea is to have a storage initializer. The initializer is responsible for taking the storage from any previous version to the latest schema version. It consists of multiple "steps" each pushing the store state to the next version. When we need to change the schema, we add an initialization step, leaving the previous ones intact. The latest executed step is stored in the db, so we know to skip any steps that have already been executed.

The initializer would be used roughly as follows:

let db = StoreInitializer::open("path/to/db")?
    .step(|dbtx| {
        // Here we can add CFs, remove CFs, copy data, change data format etc.
        // Through the provided handle to a db transaction.
    })?
    .step(|dbtx| {
        // Another step taking us to the next version up
    })?
    // More steps added here when the data format changes
    .finish();  // This gives us the final database handle initialized to the latest schema.

Rough implementation outline (making up the method names):

struct StoreInitializer { db; DB, cur_step: u32, init_db_step: u32,  }

impl StoreInitializer {
    fn open(db_path: Path) -> Result<Self> {
        let db = DB::open_with_cfs(path, ["StorageMeta"])?;
        let init_db_step = db.get("StorageMeta", "InitStep")?.unwrap_or(0);
        let cur_step = 0;
        Ok(Self {..})
    }

    fn step(mut self, body: impl FnOnce(&DbTx) -> Result<()>) -> Result<Self> {
        self.cur_step += 1;
        if self.cur_step > self.init_db_step {
            // Everything in a transaction to make sure the migration changes and
            // step bump both happen atomically.
            let dbtx = self.db.transaction()?;
            body(&dbtx)?;
            dbtx.put("StorageMeta", "InitStep", self.cur_step)?;
            dbtx.commit()?
        }
        Ok(self)
    }
}

It's basically equivalent to a series of is statements of the following form but without the need to track the version / step number manually:

if init_db_step < 1 {
    let dbtx = db.transaction()?;
    do_some_stuff_with_dbtx(&dbtx)?;
    bump_initialization_step_to(&dbtx, 1);
    dbtx.commit();
}
/* ... */
if init_db_step < N {
    let dbtx = db.transaction()?;
    do_some_other_stuff_with_dbtx(&dbtx)?;
    bump_initialization_step_to(&dbtx, N);
    dbtx.commit();
}

The scheme could be improved / made more robust in a number of ways:

Various checks could be implemented to verify the schema matches the expected state at various points.
The StorageMeta CF could also include a constant "Application" key to differentiate applications from each other to prevent a db being opened by an application that did not create it and happens to use this initialization scheme.
etc...

Let me know what you think.

2 replies

iljakuklic Oct 4, 2024
Collaborator

After some more discussion, this approach is not viable, at least not as is. Doing the migration in a batch at the beginning before the node is fully operational could result in too much downtime.

Freyskeyd Oct 4, 2024
Maintainer Author

We could also have a dedicated CLI tool to run the migration while the node is off, but that's for later :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AggLayer

Storage design and specification #201

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

AggLayer

Storage design and specification #201

Freyskeyd Aug 2, 2024 Maintainer

Introduction

Defining the needs

The different logical stores

The different physical storage

Data structures involved

CFs definition

cerfificate_per_network

certificate_header

latest_settled_certificate_per_network

metadata

proof_per_certificate

pending_queue

local_exit_tree_per_network

nullifier_tree_per_network

balance_tree_per_network

proofs_per_epoch

certificates_per_epoch

metadata_per_epoch

Assigning CFs to physical database

Conclusion

Replies: 2 comments · 6 replies

muursh Aug 2, 2024

Freyskeyd Aug 3, 2024 Maintainer Author

muursh Aug 5, 2024

Freyskeyd Aug 5, 2024 Maintainer Author

muursh Aug 5, 2024

iljakuklic Oct 3, 2024 Collaborator

iljakuklic Oct 4, 2024 Collaborator

Freyskeyd Oct 4, 2024 Maintainer Author

Freyskeyd
Aug 2, 2024
Maintainer

Replies: 2 comments 6 replies

muursh
Aug 2, 2024

Freyskeyd Aug 3, 2024
Maintainer Author

Freyskeyd Aug 5, 2024
Maintainer Author

iljakuklic
Oct 3, 2024
Collaborator

iljakuklic Oct 4, 2024
Collaborator

Freyskeyd Oct 4, 2024
Maintainer Author