Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip slot fix #4843

Merged
merged 4 commits into from
Feb 10, 2025
Merged

Skip slot fix #4843

merged 4 commits into from
Feb 10, 2025

Conversation

bw-solana
Copy link

@bw-solana bw-solana commented Feb 7, 2025

Problem

We see a case where we skip 1 more slot than we have to after a partitioning scenario.

Sequence goes like this:

  1. Leader A is slow in producing its first block N. Could be for any reason, but generally replay has slowed down heading into leader slot
  2. Leader B gives up on Leader A and tries to build it's first block N+4
  3. Leader A finally delivers N
  4. Leader B replays N, resets onto it, and abandons its block N+4
  5. Leader B receives and replays blocks N+1, N+2, and N+3
  6. Leader B is unable to rebuild N+4 because it would be seen as a duplicate block. It properly identifies shreds for this slot in the blockstore and skips its leader slot
  7. Leader B also skips building slot N+5 because it concludes it needs to wait for grace ticks

Step 7 is the bug. We could build N+5 after Leader A is done with all of its blocks, but we don't.

Also note a few additional key points to help understand why this is broken:

  1. fn next_leader_slot also checks for existence of slots in blockstore to identify which slot to start on. In the example above, when resetting onto N+3, we will set our next leader slot range to be N+5 to N+7.
  2. The next leader slots dictate how we setup ticks to start on (e.g. leader_first_tick_height_including_grace_ticks)
  3. Normally, when resetting onto N+3, we have a condition in fn reached_leader_tick (see https://github.com/anza-xyz/agave/blob/master/poh/src/poh_recorder.rs#L499-L504)that lets us skip grace ticks because we detect PoH was reset to run immediately. But this conditional is broken when we skip our first leader slot.
  4. In our example above, if N is slot 0, we would get the following: start tick height = 257, grace ticks = 64, leader_first_tick_height_including_grace_ticks = 385. We don't reset immediately, we don't trigger any of the grace tick bypasses, so we have to wait until we tick to 385, which is our 3rd slot 💀

Summary of Changes

  1. Fix up some logic in fn can_skip_grace_ticks to handle this specific case. Essentially, if we see we're building off the previous leader's blocks, allow skipping grace ticks if we've reset onto their last block.
  2. Fix up unit test to cover all possible paths through fn reached_leader_tick, including the newly added path

Testing

The first commit in this PR includes some debug prints that were used to confirm every path is covered by unit test:

running 1 test
1. We have finished waiting for grace ticks.
2. Building off my own block. No need to wait.
6. We haven't ticked to our leader slot yet.
7. Building off same fork as previous leader but they're not done
8. PoH was reset to run immediately.
9. Building off the previous leader's last block
10. No pending blocks from previous leader have been observed. No need to wait.
11. Not configured to wait for pending blocks from previous leader.
12. Wait for grace ticks
test poh_recorder::tests::test_reached_leader_tick ... ok

When running the updated unit test against unfixed fn can_skip_grace_ticks, we can see it fails for the case where we expect to be able to skip grace ticks when building off the previous leader's last block

running 1 test
1. We have finished waiting for grace ticks.
2. Building off my own block. No need to wait.
3. We haven't ticked to our leader slot yet.
4. Building off same fork as previous leader but they're not done
5. PoH was reset to run immediately.
thread 'poh_recorder::tests::test_reached_leader_tick' panicked at poh/src/poh_recorder.rs:2038:9:
assertion failed: poh_recorder.reached_leader_tick(&leader_b_pubkey,
    leader_b_start_tick + grace_ticks)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test poh_recorder::tests::test_reached_leader_tick ... FAILED

@bw-solana bw-solana marked this pull request as ready for review February 7, 2025 17:43
jstarry
jstarry previously approved these changes Feb 10, 2025
Copy link

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

// previous leader's slots.
// If so, PoH is currently building on the previous leader's blocks
// If not, PoH is building on a different fork
slot == self.start_slot()
})
}

// Check if the last slot PoH reset onto was the previous leader's last slot.
fn building_off_previous_leader_last_block(&self, my_pubkey: &Pubkey, next_slot: Slot) -> bool {
// Walk backwards from the next slot we want to build.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be "from the slot before our next leader slot"

@bw-solana bw-solana merged commit 9ba0b93 into anza-xyz:master Feb 10, 2025
42 of 44 checks passed
@bw-solana bw-solana deleted the skip_slot_fix branch February 10, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants