-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[optqs] bug fixes and perf improvements #15452
base: main
Are you sure you want to change the base?
Conversation
⏱️ 9h 40m total CI duration on this PR
|
4f486f1
to
b55b5f0
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b55b5f0
to
2b3e3ac
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
bc884d4
to
af611f0
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
9db3836
to
8ebd336
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
3dab484
to
be1a1b2
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Self::gc_previous_epoch_batches_from_db(db_clone, epoch); | ||
}); | ||
} else { | ||
Self::gc_expired_batches_from_db( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all gc be spawn_blocking
? An epoch nearing completion could also have a lot of batches to gc? I see this call eventually calls spawn_blocking
on delete_batches
so might as well make things consistent and guard against the get_all_batches being expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be the name is misleading. this call also populates the batch store in memory cache and until we populate this cache we cannot return. so this cannot be spawned async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it seems more consistent to both spawn_blocking, this branch is for restart right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. Maybe some renaming/moving code around will make this more clear :) Not a blocker, more a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I didn't see the response lol
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should split the optqs changes out from the pr, it's too big probably shouldn't rush into release
@@ -286,6 +288,10 @@ impl PipelinedBlock { | |||
.take() | |||
.expect("pre_commit_result_rx missing.") | |||
} | |||
|
|||
pub fn set_qc(&self, qc: Option<Arc<QuorumCert>>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this takes an Option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the option and made an assertion when calling set_qc
self.proposal.epoch() | ||
/// Returns the epoch associated with the proposal after verifying that the | ||
/// payload data is also associated with the same epoch | ||
pub fn verified_epoch(&self) -> Result<u64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems better to verify the payload epoch in verify_well_formed function and keep here to just return u64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is how we did for BatchMsg seems like, copied that. moved it to verify_well_formed
.
for block in &blocks_to_commit { | ||
// Given the block is ready to commit, then it certainly must have a QC. | ||
// However, we keep it an Option to be safe. | ||
block.set_qc(self.get_quorum_cert_for_block(block.id())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, we can just do a reverse iteration since every block carries the parent's qc, we still need to get the qc for the last block though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think the call is expensive? the change looks more verbose... I also want to Arc the QC so I can avoid the cloning the QC.
blocks_to_commit
.last()
.expect("at least one block is required")
.set_qc(
self
.get_quorum_cert_for_block(block.id())
.expect("QC must be present"),
);
for [parent, child] in blocks_to_commit.windows(2).rev() {
parent.set_qc(Arc::new(child.block().quorum_cert().clone()));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah I just didn't like the Option there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i will just expect a QC then and still use the function call.
consensus/src/payload_manager.rs
Outdated
|
||
let result = { | ||
// If the future is completed then use the result. | ||
if let Some(result) = fut.clone().now_or_never() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth having this early check? just appending the responders doesn't seem too bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, makes sense. we compute it anyhow. I will remove it.
consensus/src/payload_manager.rs
Outdated
@@ -604,11 +643,13 @@ async fn process_payload_helper<T: TDataInfo>( | |||
.batch_summary | |||
.iter() | |||
.map(|proof| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a proof but a data ptr right?
consensus/src/payload_manager.rs
Outdated
@@ -604,11 +643,13 @@ async fn process_payload_helper<T: TDataInfo>( | |||
.batch_summary | |||
.iter() | |||
.map(|proof| { | |||
let mut signers = proof.signers(ordered_authors); | |||
let mut signers = signers.clone(); | |||
signers.append(&mut proof.signers(ordered_authors)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we add the qc signers to proof responders too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only passed for opt batches not proofs. proof
is a data ptr. will rename.
Self::gc_previous_epoch_batches_from_db(db_clone, epoch); | ||
}); | ||
} else { | ||
Self::gc_expired_batches_from_db( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it seems more consistent to both spawn_blocking, this branch is for restart right?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
2c8c283
to
a86ca72
Compare
a86ca72
to
3739e25
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
} | ||
}, | ||
Payload::QuorumStoreInlineHybrid(inline_batches, proof_with_data, _) => { | ||
if proof_with_data.proofs.iter().any(|p| p.epoch() != epoch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's slightly better to have this?
ensure!(iter.all(..), "...");
@@ -77,6 +77,9 @@ impl ProposalMsg { | |||
"Proposal {} does not define an author", | |||
self.proposal | |||
); | |||
self.proposal | |||
.payload() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels it should go inside the proposal.verify_well_formed?
@@ -56,6 +56,7 @@ pub trait TPayloadManager: Send + Sync { | |||
async fn get_transactions( | |||
&self, | |||
block: &Block, | |||
block_signers: Option<BitVec>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this is not vector of signers because qc only has the bitvec?
@@ -453,13 +461,15 @@ impl TPayloadManager for QuorumStorePayloadManager { | |||
self.batch_reader.clone(), | |||
block, | |||
&self.ordered_authors, | |||
block_signers.as_ref(), | |||
) | |||
.await?; | |||
let proof_batch_txns = process_payload_helper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rename "process_payload_helper` and "process_payload", they're confusing lol. maybe "process_optqs_payload" and "process_qs_payload" etc
let _tracker = Tracker::new("prepare", &block); | ||
// the loop can only be abort by the caller | ||
let input_txns = loop { | ||
match preparer.prepare_block(&block).await { | ||
match preparer.prepare_block(&block, qc.clone()).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably does't work, since the future is built when we receive the block, we need to pass the Arc of Mutex here and check if qc is set on every loop run
@@ -1135,6 +1135,20 @@ impl RoundManager { | |||
|
|||
pub async fn process_verified_proposal(&mut self, proposal: Block) -> anyhow::Result<()> { | |||
let proposal_round = proposal.round(); | |||
let sync_info = self.block_store.sync_info(); | |||
|
|||
if proposal_round <= sync_info.highest_round() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought inserting existing block is no-op (and will not create a vote)
Description
This PR includes the following fixes
In optQS, batches can only be fetched from consensus and batch proposer until the block gets a QC. after QC, batches should be fetched from QC signers as well to guarantee progress.
Builds on top of previous commit to update responders in flight without waiting for previous fetch to complete.
If a validators fetches in critical path of proposal and then votes, it is possible that it might already have a SyncInfo for that round because other validators moved on. In such a case, the validator should ignore the proposal instead of voting. This issue came up in logs because some validators couldn't verify the Vote message because the vote round and sync info QC round were same.