Rewrite State Sync, from a giant state machine to proper async code. #12172

robin-near · 2024-09-30T15:49:18Z

This rewrites state sync. All functionality is expected to continue to work without any protocol or database changes.

See the top of state/mod.rs for an overview.

State sync status is now available on the debug page; an example:

codecov · 2024-10-01T03:08:30Z

Codecov Report

Attention: Patch coverage is 61.60377% with 407 lines in your changes missing coverage. Please review.

Project coverage is 71.66%. Comparing base (e0d9637) to head (3b31338).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
chain/client/src/sync/state/network.rs	1.57%	188 Missing ⚠️
chain/client/src/sync/state/downloader.rs	76.12%	27 Missing and 10 partials ⚠️
chain/client/src/sync/state/mod.rs	82.00%	33 Missing and 3 partials ⚠️
chain/client/src/sync/state/shard.rs	83.07%	6 Missing and 27 partials ⚠️
chain/client/src/client_actor.rs	39.47%	19 Missing and 4 partials ⚠️
chain/client/src/sync/state/external.rs	86.08%	8 Missing and 8 partials ⚠️
chain/client/src/test_utils/setup.rs	15.78%	16 Missing ⚠️
chain/client-primitives/src/types.rs	31.57%	13 Missing ⚠️
chain/client/src/info.rs	0.00%	13 Missing ⚠️
chain/network/src/peer/peer_actor.rs	0.00%	8 Missing ⚠️
... and 9 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12172      +/-   ##
==========================================
- Coverage   71.83%   71.66%   -0.18%     
==========================================
  Files         827      834       +7     
  Lines      166639   166332     -307     
  Branches   166639   166332     -307     
==========================================
- Hits       119713   119201     -512     
- Misses      41709    41914     +205     
  Partials     5217     5217

Flag	Coverage Δ
backward-compatibility	`0.17% <0.00%> (+<0.01%)`	⬆️
db-migration	`0.17% <0.00%> (+<0.01%)`	⬆️
genesis-check	`1.25% <0.00%> (+<0.01%)`	⬆️
integration-tests	`38.79% <61.60%> (+<0.01%)`	⬆️
linux	`71.34% <61.60%> (-0.20%)`	⬇️
linux-nightly	`71.24% <61.60%> (-0.19%)`	⬇️
macos	`54.31% <8.41%> (+0.96%)`	⬆️
pytests	`1.57% <0.00%> (+<0.01%)`	⬆️
sanity-checks	`1.37% <0.00%> (+<0.01%)`	⬆️
unittests	`65.47% <8.41%> (-0.18%)`	⬇️
upgradability	`0.21% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shreyan-gupta

Looks great!

shreyan-gupta · 2024-10-01T17:55:55Z

chain/client/src/sync/state/mod.rs

+        store: Store,
+        epoch_manager: Arc<dyn EpochManagerAdapter>,
+        runtime: Arc<dyn RuntimeAdapter>,
+        network_adapter: AsyncSender<PeerManagerMessageRequest, PeerManagerMessageResponse>,


nit: Why not use type PeerManagerAdapter here like everywhere else?

Changed to PeerMananagerAdapter.

shreyan-gupta · 2024-10-01T18:04:19Z

chain/client/src/sync/state/downloader.rs

+                    )
+                    .await?;
+                let state_root = header.chunk_prev_state_root();
+                if runtime_adapter.validate_state_part(


We can remove the validate_state_part from runtime to just being a part of some util. It doesn't have any need to be in runtime.

That is out of scope of this refactoring that is already large :)

shreyan-gupta · 2024-10-01T18:12:52Z

chain/client/src/client.rs

@@ -2579,7 +2576,7 @@ impl Client {
        sync_hash: CryptoHash,
        state_sync_info: &StateSyncInfo,
        me: &Option<AccountId>,
-    ) -> Result<HashMap<u64, ShardSyncDownload>, Error> {
+    ) -> Result<HashMap<u64, ShardSyncStatus>, Error> {


ShardSyncDownload and ShardSyncDownloadView are no longer used anywhere, can be deleted.
Nit: update the comment above

shreyan-gupta · 2024-10-01T18:17:15Z

chain/client/src/client.rs

@@ -2465,34 +2464,40 @@ impl Client {

        for (sync_hash, state_sync_info) in self.chain.chain_store().iterate_state_sync_infos()? {
            assert_eq!(sync_hash, state_sync_info.epoch_tail_hash);
-            let network_adapter = self.network_adapter.clone();
-
            let shards_to_split = self.get_shards_to_split(sync_hash, &state_sync_info, &me)?;


Ugh, I think Alex merged in some change here. Will need to check if we need shards_to_split at all.

shreyan-gupta · 2024-10-01T18:18:44Z

chain/client/src/client.rs

-                            network_adapter,
+                            self.runtime_adapter.store().clone(),
+                            self.epoch_manager.clone(),
+                            self.runtime_adapter.clone(),


I'm not super happy passing in the runtime into state sync. On the top level it doesn't seem like there should be any dependency between these two. Could we check if it's possible to decouple? Maybe as a follow up to this PR.

We can look into that after yeah.

shreyan-gupta · 2024-10-01T19:44:52Z

chain/client/src/sync/state/shard.rs

+    return_if_cancelled!(cancel);
+
+    // Finalize; this needs to be done by the Chain.
+    *status.lock().unwrap() = ShardSyncStatus::StateApplyFinalizing;


Nice! With this setup, it should ideally be possible for us to convert ShardSyncStatus to a string instead? And we can keep adding more and more status with more information?

I didn't want to touch that part because it was also used in a couple of other places. Maybe I can clean it up after.

shreyan-gupta · 2024-10-01T19:48:06Z

chain/client/src/sync/state/shard.rs

+/// would be blocked by the computation, thereby not allowing computation of other
+/// futures driven by the same driver to proceed. This function respawns the future
+/// onto the FutureSpawner, so the driver of the returned future would not be blocked.
+fn respawn_for_parallelism<T: Send + 'static>(


Instead of this, is it not possible to have small enough compute intensive tasks so as to not block the driver? This looks a bit odd overall where we are effectively "transferring" the future to a different spawner.awkward

That's basically what it does; the thing is that tokio_stream::iter never spawns anything; it only awaits multiple futures on the same driver.

shreyan-gupta · 2024-10-01T19:52:27Z

chain/client/src/sync/state/mod.rs

+    /// and then in `run` we process them.
+    header_validation_queue: UnboundedReceiver<StateHeaderValidationRequest>,
+    chain_finalization_queue: UnboundedReceiver<ChainFinalizationRequest>,
+    chain_finalization_sender: UnboundedSender<ChainFinalizationRequest>,


Why do we need to store the sender as part of StateSync?

This is removed; replaced with AsyncSender.

shreyan-gupta · 2024-10-01T19:53:44Z

chain/client/src/sync/state/mod.rs

+                    }
+                    Err(TryRecvError::Empty) => entry.get().status(),
+                },
+                Entry::Vacant(entry) => {


Could we encapsulate this into a function like self.start_state_sync_for_shard for better readability?

There is already the run_state_sync_for_shard function and the stuff here really isn't any useful logic other than just boilerplate code.

shreyan-gupta · 2024-10-01T19:56:00Z

chain/client/src/sync/state/mod.rs

+    chain_finalization_sender: UnboundedSender<ChainFinalizationRequest>,
+
+    /// There is one entry in this map for each shard that is being synced.
+    shard_syncs: HashMap<(CryptoHash, ShardId), StateSyncShardHandle>,


Wait, where and how are we adding new entries into this?

We do a self.shard_syncs.entry(key) at the top, and if it's vacant we insert something into the entry.

shreyan-gupta · 2024-10-02T01:52:04Z

chain/client/src/sync/state/mod.rs

+    }
+
+    /// Processes the requests that the state sync module needed the Chain for.
+    fn process_chain_requests(&mut self, chain: &mut Chain) {


Slightly wary about having this setup for sync stuff happening...
While this is objectively better for our use case here, we have a pattern where we send an actix message to client to handle all things that should be in sync in client and this sorta breaks that... I can't think of better ways or alternatives

Changed to AsyncSender.

saketh-are · 2024-10-02T23:29:49Z

chain/client/src/client.rs

+                            sync_status: shards_to_split.clone(),
+                            download_tasks: Vec::new(),
+                            computation_tasks: Vec::new(),
+                        },
                        BlocksCatchUpState::new(sync_hash, *epoch_id),
                    )
                });

            // For colour decorators to work, they need to printed directly. Otherwise the decorators get escaped, garble output and don't add colours.


nit: This comment is no longer relevant

saketh-are · 2024-10-02T23:34:32Z

chain/client/src/client_actor.rs

@@ -2079,13 +2044,21 @@ impl ClientActorInner {

        if block_hash == sync_hash {
            // The first block of the new epoch.
+            if let Err(err) = self.client.chain.validate_block(&block) {


Currently we see a lot of "Received an invalid block during state sync" spam during state sync because the node doesn't know how to validate blocks at the head of the chain. I think it makes sense to only validate the blocks that state sync is specifically looking for

saketh-are · 2024-10-03T00:00:11Z

chain/client/src/sync/state/network.rs

+                            part_id: *part_id,
+                        },
+                    );
+                    let state_value = PendingPeerRequestValue { peer_id: None, sender };


In the handling for NetworkRequests::StateRequestPart we select a specific peer from which to request the part, so it should be possible to store that and verify that the response comes back from the expected peer. However, it might be a bit ugly to pass the selected peer id back here from the network side of things, and I expect to redo how the state headers work soon, so I am fine with just leaving this as-is for now.

Yeah.. this part is awkward indeed. I also don't know what to do about it right now.

saketh-are · 2024-10-03T00:03:32Z

chain/network/src/peer/peer_actor.rs

@@ -1100,7 +1100,14 @@ impl PeerActor {
                    .map(|response| PeerMessage::VersionedStateResponse(*response.0)),
                PeerMessage::VersionedStateResponse(info) => {
                    //TODO: Route to state sync actor.


seems outdated

saketh-are · 2024-10-03T00:22:35Z

chain/client/src/sync/state/mod.rs

+        let downloader = Arc::new(StateSyncDownloader {
+            clock,
+            store: store.clone(),
+            preferred_source: peer_source,


At the moment we are in an ugly situation with state headers; a node needs to be tracking all shards to serve headers for any shard, nodes no longer track all shards, and the strategy for trying to obtain headers from the network is to request them from direct peers of the node at random. It is a remnant of the times when every peer tracked every shard, and it needs to be rewritten entirely.

Before this PR, if an external source was available we would just directly get the headers from it without any attempts to get it from the network. It looks like we are changing that now and will try the network first for headers. I think it should be OK; just giving a heads up that there is a behavior change hidden here.

You mean before the PR, even if we configure to download from peers, we would always fetch the header from external? Hmm. OK.

shreyan-gupta · 2024-10-08T09:53:33Z

chain/client/src/sync/state/mod.rs

+/// headers and parts in parallel for the requested shards, but externally, all that it exposes
+/// is a single `run` method that should be called periodically, returning that we're either
+/// done or still in progress, while updating the externally visible status.
+pub struct StateSync {


Btw, is the async part compatible with testloop?

It is. There's nothing that is incompatible. It's removed anyway.

robin-near · 2024-10-11T18:37:29Z

So @saketh-are @shreyan-gupta I addressed some of the issues and I'm just going to just merge this PR as it is right now, because the longer it takes the more merge conflicts I get. Then I'll follow up to clean up some of the stuff mentioned in the review comments.

pugachAG

Approving to unblock merging assuming @saketh-are and @shreyan-gupta already reviewed this

robin-near requested a review from a team as a code owner September 30, 2024 15:49

robin-near requested a review from akhi3030 September 30, 2024 15:49

robin-near marked this pull request as draft September 30, 2024 15:49

robin-near force-pushed the statesync branch from 774ac44 to b3c5ff5 Compare October 1, 2024 02:46

robin-near requested review from saketh-are, VanBarbascu and shreyan-gupta October 1, 2024 02:47

Rewrite State Sync, from a giant state machine to proper async code.

5d423a9

robin-near force-pushed the statesync branch from b3c5ff5 to 5d423a9 Compare October 1, 2024 02:48

robin-near marked this pull request as ready for review October 1, 2024 02:48

shreyan-gupta reviewed Oct 1, 2024

View reviewed changes

shreyan-gupta reviewed Oct 2, 2024

View reviewed changes

saketh-are reviewed Oct 3, 2024

View reviewed changes

robin-near added 2 commits October 8, 2024 01:09

Merge remote-tracking branch 'origin/master' into HEAD

0fc9369

Merge remote-tracking branch 'origin/master' into HEAD

b262cca

shreyan-gupta reviewed Oct 8, 2024

View reviewed changes

saketh-are mentioned this pull request Oct 11, 2024

fix(state sync): handle StateResponse on state_parts_future_spawner #12205

Merged

robin-near added 2 commits October 11, 2024 10:57

Use Sender instead of custom future, for chain requests.

e974355

Merge remote-tracking branch 'origin/master' into HEAD

0a3ac12

pugachAG approved these changes Oct 11, 2024

View reviewed changes

robin-near added this pull request to the merge queue Oct 11, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 11, 2024

Merge remote-tracking branch 'origin/master' into HEAD

3f8108b

robin-near added this pull request to the merge queue Oct 15, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 15, 2024

Fix bug.

3b31338

robin-near enabled auto-merge October 15, 2024 15:44

robin-near linked an issue Oct 15, 2024 that may be closed by this pull request

[Epoch Sync] Blocks received during state sync marked as invalid #11955

Closed

robin-near added this pull request to the merge queue Oct 15, 2024

Merged via the queue into near:master with commit 3272c0e Oct 15, 2024
27 of 30 checks passed

robin-near deleted the statesync branch October 15, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite State Sync, from a giant state machine to proper async code. #12172

Rewrite State Sync, from a giant state machine to proper async code. #12172

robin-near commented Sep 30, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

shreyan-gupta left a comment

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 1, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 2, 2024

robin-near Oct 11, 2024

saketh-are Oct 2, 2024

saketh-are Oct 2, 2024

saketh-are Oct 3, 2024

robin-near Oct 11, 2024

saketh-are Oct 3, 2024

saketh-are Oct 3, 2024

robin-near Oct 11, 2024

shreyan-gupta Oct 8, 2024

robin-near Oct 11, 2024

robin-near commented Oct 11, 2024

pugachAG left a comment

Rewrite State Sync, from a giant state machine to proper async code. #12172

Rewrite State Sync, from a giant state machine to proper async code. #12172

Conversation

robin-near commented Sep 30, 2024 • edited Loading

codecov bot commented Oct 1, 2024 • edited Loading

Codecov Report

shreyan-gupta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robin-near commented Oct 11, 2024

pugachAG left a comment

Choose a reason for hiding this comment

robin-near commented Sep 30, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading