Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(state-sync): sync to the current epoch instead of the previous #12102

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

marcelo-gonzalez
Copy link
Contributor

When a node processes a block that’s the first block of epoch T, and it realizes that it will need to track shards that it doesn’t currently track in epoch T+1, it syncs state to the end of epoch T-1 and then applies chunks until it’s caught up. We want to change this so that it syncs state to epoch T instead, so that the integration of state sync/catchup and resharding will be simpler.

In this PR, this is done by keeping most of the state sync logic unchanged, but changing the “sync_hash” that’s used to identify what point in the chain we want to sync to. Before, “sync_hash” was set to the first block of an epoch, and the existing state sync logic would have us sync the state as of two chunks before this hash. So here we change the sync hash to be the hash of the first block for which at least two new chunks have been seen for each shard in its epoch. This allows us to sync state to epoch T with minimal modifications, because the old logic is still valid.

Note that this PR does not implement support for this new way of syncing for nodes that have fallen several epochs behind the chain, rather than nodes that need to catchup for an upcoming epoch. This can be done in a future PR

@marcelo-gonzalez
Copy link
Contributor Author

Btw, for reviewers, this PR is not quite ready to be submitted because it is only minimally tested on localnet, where I just checked that it syncs properly. You can try it with this:

diff --git a/core/primitives/src/epoch_manager.rs b/core/primitives/src/epoch_manager.rs
index 281be4399..b73b9348a 100644
--- a/core/primitives/src/epoch_manager.rs
+++ b/core/primitives/src/epoch_manager.rs
@@ -166,6 +166,8 @@ impl AllEpochConfig {
     pub fn generate_epoch_config(&self, protocol_version: ProtocolVersion) -> EpochConfig {
         let mut config = self.genesis_epoch_config.clone();
 
+        config.validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true;
+
         Self::config_mocknet(&mut config, &self.chain_id);
 
         if !self.use_production_config {

Then if you run the transactions.py pytest, it should be able to finish after nodes sync state in the new way (you might have to comment out some asserts in that test that fail sometimes, looks unrelated but will check it)

So before submitting, I need to try this with more meaningful state and traffic/receipts, probably on forknet. Also would be good to add some integration tests, and fix whichever integration tests or pytests might have been broken by this. Also the FIXME comment in this PR needs to be fixed before I can submit this. But in any case, it should mostly be ready for review

One thing to be decided in this PR review is whether the gating via current protocol version I put in there looks okay. It feels kind of ugly to me, but it might be the easiest way to go

/// is the first block of the epoch, these two meanings are the same. But if the sync_hash is moved forward
/// in order to sync the current epoch's state instead of last epoch's, this field being false no longer implies
/// that we want to apply this block during catchup, so some care is needed to ensure we start catchup at the right
/// point in Client::run_catchup()
pub(crate) is_caught_up: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to split this field into multiple fields (or enum) to differentiate these meanings? it feels like the field being false indicates both we want to apply the chunks and not apply the chunks based on other state such as sync_hash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually in both cases if this field is false, we don't want to apply the chunks for shards we don't currently track, and this logic should be the same:

fn get_should_apply_chunk(

I think we probably could split it, but it's a little bit tricky. Let me think about it actually... For now in this PR it is kept as is to not have to touch too many things and possibly break something. the tricky part is that right now we add the first block of the epoch to the BlocksToCatchup column based on this field, which is then read to see if we'll need to catch up the next block after this one as well:

Ok((self.prev_block_is_caught_up(&prev_prev_hash, &prev_hash)?, None))

I guess where that is called maybe we can just call get_state_sync_info() again, and also check if catchup is already done, but it requires some care

@marcelo-gonzalez
Copy link
Contributor Author

I actually just removed the test_mock_node_basic() test, since I think it's kind of outdated anyway. hopefully nobody objects to that... I kind of have some plans to just delete the hacky part of the mock-node code that generates home dirs in favor of just providing the home dirs from mainnet or testnet up-front anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants