Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(l1): snap sync overhaul #1763

Merged
merged 208 commits into from
Jan 29, 2025
Merged

feat(l1): snap sync overhaul #1763

merged 208 commits into from
Jan 29, 2025

Conversation

fmoletta
Copy link
Contributor

@fmoletta fmoletta commented Jan 21, 2025

Motivation
This PR introduces the following upgrades for snap-sync:

  • Use DB-persisted checkpoints so we can persist the sync progress throughout restarts & cycles
  • Stop ForckChoices & NewPayloads being applied while syncing
  • Improved handling of stale pivot during sub-processes
  • Improved handling of pending requests when aborting due to stale pivot
  • Fetching of large storage tries (that don't fit in a single range request)
  • Safer (but a bit slower) healing that can be restarted
  • Faster storage fetching (multiple parallel fetches)

And also simplifies it by removing the following logic:

  • No longer downloads bodies and receipts for blocks before the pivot during snap sync (WARNING: this goes against the spec but shouldn't be a problem for the time being)
  • Removes restart from latest block when latest - 64 becomes stale. (By this point it is more effective to wait for the next fork choice update)
  • Periodically shows state sync progress

Description

  • Stores the last downloaded block's hash in the DB during snap sync to serve as a checkpoint if the sync is aborted halfway (common case when syncing from genesis). This checkpoint is cleared upon succesful snap sync.

  • No longer fetches receipts or block bodies past the pivot block during snap sync

  • Add method sync_status which returns an enum with the current sync status (either Inactive, Active or Pending) and uses it in the ForkChoiceUpdate & NewPayload engine rpc endpoints so that we don't apply their logic during an active or pending sync.

  • Fetcher process now identify stale pivots and remain passive until they receive the end signal

  • Fetcher processes now return their current queue upon return so that it can be persisted into the next cycle

  • Stores the latest state root during state sync and healing as a checkpoint

  • Stores the last fetched key during state sync as a checkpoint

  • Healing no longer stores the nodes received via p2p, it instead inserts the leaf values and rebuilds it to avoid trie corruption between restarts.

  • The current progress percentage and estimated time to finish is periodically reported during state sync

  • Disables the following Paris & Cancun engine hive tests that previously yielded false positives due to new payloads being accepted on top of a syncing chain:

    • Invalid NewPayload (family)
    • Re-Org Back to Canonical Chain From Syncing Chain
    • Unknown HeadBlockHash
    • In-Order Consecutive Payload Execution (Flaky)
    • Valid NewPayload->ForkchoiceUpdated on Syncing Client
    • Invalid Missing Ancestor ReOrg
      • Payload Build after New Invalid Payload
 (only Cancun)
  • And also disables the following tests that fail with the flag Syncing=true for the same reason :

    • Bad Hash on NewPayload
    • ParentHash equals BlockHash on NewPayload (only for Paris)
    • Invalid PayloadAttributes (family)

Misc:

Closes None

Closes #issue_number

@fmoletta fmoletta marked this pull request as ready for review January 22, 2025 21:53
@fmoletta fmoletta requested a review from a team as a code owner January 22, 2025 21:53
@fmoletta fmoletta added this pull request to the merge queue Jan 29, 2025
Merged via the queue into main with commit ea3ae65 Jan 29, 2025
20 checks passed
@fmoletta fmoletta deleted the sync-overhaul branch January 29, 2025 14:01
github-merge-queue bot pushed a commit that referenced this pull request Feb 3, 2025
**Motivation**
Reverts the "slow but safe" approach to healing taken by #1763 and
instead adds the intermediate nodes received from peers to the trie.
<!-- Why does this pull request exist? What are its goals? -->

**Description**
* Write nodes fetched during healing to the state instead of writing
only leaf values
* Track paths left in queue when a healing cycle ends due to pivot
staleness
* Condense snap checkpoint clearing to one single method
* (Misc) Move some noisy p2p info tracing to debug
* (Fix) When executing full blocks after the pivot block during snap
sync execute the block first before setting it as canonical
<!-- A clear and concise general description of the changes this PR
introduces -->

<!-- Link to issues: Resolves #111, Resolves #222 -->
**Comments**
A not so pretty solution was implemented on the storage_healer (the
startup flag) to avoid blocking storage healing at the start. This can
and will be improved but doing so in this PR is not worth it when follow
up PRs will already touch up on this logic when parallelizing storage
node fetching (similar to storage ranges)

**Testing**
This proved to work fine in `Mekong` testnet when introducing forced
sleeps and delays to force healing. But it is still too slow to work on
Holesky testnet, this should improve once we add more parallelization to
state sync & healing

Closes #issue_number
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants