From ebb4717b9ae93c4b48c879456e12ebfebe806ba6 Mon Sep 17 00:00:00 2001 From: robin-near <111538878+robin-near@users.noreply.github.com> Date: Thu, 5 Dec 2024 21:04:59 -0800 Subject: [PATCH] [Epoch Sync] Make epoch sync happen before header sync on AwaitingPeers. (#12563) I think if I understand correctly, the way the AwaitingPeers state works is simply a marker for the starting state. The mechanism by which we transition away from the AwaitingPeers state is by header sync replacing it with HeaderSync when there are enough peers to run the header sync code at all. So, before this PR, what would happen is that we start with AwaitingPeers, and epoch sync will see that and say "oh we don't have enough peers, so let's skip", but then header sync takes the stage and starts syncing headers. This ruins the header_head by moving it away from genesis, making epoch sync no longer eligible. In fact, this happens pretty reliably because at startup we would always perform header sync first before performing epoch sync, and since epoch sync is most likely slower than the first header sync response, we're continuing epoch sync with an incorrect header_head (causing either an almost-correct proof application, or a stall if the epoch sync request fails). There are a few more hardening fixes that we should consider, but for now, this should fix the root cause, by no longer treating AwaitingPeers as special. By the way we'll also not treat StateSync as special, because that just can't be possible if the header_head is at genesis. --- chain/client/src/sync/epoch.rs | 3 --- 1 file changed, 3 deletions(-) diff --git a/chain/client/src/sync/epoch.rs b/chain/client/src/sync/epoch.rs index 632d3a29397..a08b5e3c433 100644 --- a/chain/client/src/sync/epoch.rs +++ b/chain/client/src/sync/epoch.rs @@ -604,9 +604,6 @@ impl EpochSync { return Ok(()); } match status { - SyncStatus::AwaitingPeers | SyncStatus::StateSync(_) => { - return Ok(()); - } SyncStatus::EpochSync(status) => { if status.attempt_time + self.config.timeout_for_epoch_sync < self.clock.now_utc() { tracing::warn!("Epoch sync from {} timed out; retrying", status.source_peer_id);