Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update syncing logic to fix duplicate block requests #3410

Open
wants to merge 5 commits into
base: staging
Choose a base branch
from

Conversation

kpandl
Copy link
Contributor

@kpandl kpandl commented Oct 3, 2024

Motivation

As issue #3404 described, validators send out duplicate block requests when syncing. This PR updates the syncing logic to make sure we only remove a block response after it was added to the ledger or if we encountered an error.

For background, currently, syncing works as follows: BlockSync requests the blocks, and stores it in the internal responses map. Sync calls block_sync.process_next_block(current_height) to get the next block to process. This removes it from BlockSync’s responses and returns it to Sync. Sync performs the checks, and if successful, adds the block to the ledger. This also updates BlockSync’s canon.

The issue is that BlockSync‘s view of the canon is only updated after Sync is done processing blocks (and adding to the ledger). There’s a window where BlockSync is unaware of blocks that are pending validation in Sync. Thus, it re-requests them.

This PR adjusts the block processing interface to the Sync module as follows:

  • process_next_block READS a block from a particular height
  • remove_block_response REMOVES a block from a particular height, which MUST be called after advancing completed or failed

Note: in the latest commit, process_next_block was renamed to peek_next_block, following the reviewer discussion.

Test Plan

Ran it locally and verified it removes the duplicate block request.
Also ran it with light load in a cloud devnet, and verified syncing works for validators and clients.

The following figure shows a syncing process before the fix, with a longer pause until a duplicate request is generated:
syncing_not_fixed

The following figure shows the syncing process with the fix (note there is no double request and no delay, as such, the syncing is much faster):
syncing_fixed

Closes #3404

@vicsn vicsn requested a review from zkxuerb October 3, 2024 13:24
@kpandl kpandl force-pushed the fix-sync-duplicate-block-request branch from 9f546d8 to 88d5574 Compare December 16, 2024 13:59
node/sync/src/block_sync.rs Outdated Show resolved Hide resolved
}
},
Err(err) => {
warn!("{err}");
Copy link
Collaborator

@ljedrz ljedrz Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is a change that might surprise users who relied on the format of the old log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good find! I've added more expressive warnings similar to the previous code here: d6e4f73

Copy link
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some nits

zkxuerb
zkxuerb previously approved these changes Dec 19, 2024
}

/// Removes the block request for the given peer IP, if it exists.
#[allow(dead_code)]
fn remove_block_request_to_peer(&self, peer_ip: &SocketAddr, height: u32) {
let mut can_revoke = self.responses.read().get(&height).is_none();
let mut can_revoke = self.process_next_block(height).is_none();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is remove_block_request_to_peer now calling process_next_block instead of just removing it from the responses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I answered my own question. You are now nesting the response removal inside process_next_block and adding some additional checks.

Copy link
Collaborator

@ljedrz ljedrz Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I also believe it to be fine, an adjustment to the function name or extending the doc comment could be beneficial here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good discussion! I've renamed process_next_block to peek_next_block and adjusted the doc comments here: 778961a
Indeed, we are no longer removing the block from the responses in the peek_next_block/process_next_block function.

Copy link
Contributor

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think I'll feel a bit more comfortable with the change after we run our gambit of sync tests and malice tests, but otherwise logic looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Duplicate block requests when validators sync
4 participants