Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[optqs] bug fixes and perf improvements #15452

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Dec 2, 2024

Description

This PR includes the following fixes

  • [optqs] support fetching batches from QC signers
    In optQS, batches can only be fetched from consensus and batch proposer until the block gets a QC. after QC, batches should be fetched from QC signers as well to guarantee progress.
  • [optqs] ability to update responders on inflight fetch
    Builds on top of previous commit to update responders in flight without waiting for previous fetch to complete.
  • [optqs] Ignore stale proposals due to fetch lag
    If a validators fetches in critical path of proposal and then votes, it is possible that it might already have a SyncInfo for that round because other validators moved on. In such a case, the validator should ignore the proposal instead of voting. This issue came up in logs because some validators couldn't verify the Vote message because the vote round and sync info QC round were same.
  • Enables OptQS on forge

Copy link

trunk-io bot commented Dec 2, 2024

⏱️ 9h 40m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
test-target-determinator 1h 29m 🟩🟩🟩🟩🟩 (+16 more)
execution-performance / single-node-performance 1h 8m 🟩🟩🟩
forge-realistic-env-graceful-overload / forge 56m 🟥🟥
rust-cargo-deny 38m 🟩🟩🟩🟩🟩 (+17 more)
check-dynamic-deps 24m 🟩🟩🟩🟩🟩 (+18 more)
rust-move-tests 14m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun force-pushed the balaji/optqs-batch branch 2 times, most recently from 4f486f1 to b55b5f0 Compare December 2, 2024 23:22
@ibalajiarun ibalajiarun added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Dec 2, 2024

This comment has been minimized.

@ibalajiarun ibalajiarun changed the title Balaji/optqs batch [optqs] support fetching batches from QC signers Dec 2, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun marked this pull request as ready for review December 3, 2024 01:45

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun force-pushed the balaji/optqs-batch branch 2 times, most recently from 3dab484 to be1a1b2 Compare December 3, 2024 19:12

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Self::gc_previous_epoch_batches_from_db(db_clone, epoch);
});
} else {
Self::gc_expired_batches_from_db(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all gc be spawn_blocking? An epoch nearing completion could also have a lot of batches to gc? I see this call eventually calls spawn_blocking on delete_batches so might as well make things consistent and guard against the get_all_batches being expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be the name is misleading. this call also populates the batch store in memory cache and until we populate this cache we cannot return. so this cannot be spawned async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems more consistent to both spawn_blocking, this branch is for restart right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. Maybe some renaming/moving code around will make this more clear :) Not a blocker, more a nit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I didn't see the response lol

@ibalajiarun ibalajiarun added CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR and removed CICD:run-forge-e2e-perf Run the e2e perf forge only labels Dec 4, 2024

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should split the optqs changes out from the pr, it's too big probably shouldn't rush into release

@@ -286,6 +288,10 @@ impl PipelinedBlock {
.take()
.expect("pre_commit_result_rx missing.")
}

pub fn set_qc(&self, qc: Option<Arc<QuorumCert>>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this takes an Option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the option and made an assertion when calling set_qc

self.proposal.epoch()
/// Returns the epoch associated with the proposal after verifying that the
/// payload data is also associated with the same epoch
pub fn verified_epoch(&self) -> Result<u64> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems better to verify the payload epoch in verify_well_formed function and keep here to just return u64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how we did for BatchMsg seems like, copied that. moved it to verify_well_formed.

for block in &blocks_to_commit {
// Given the block is ready to commit, then it certainly must have a QC.
// However, we keep it an Option to be safe.
block.set_qc(self.get_quorum_cert_for_block(block.id()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, we can just do a reverse iteration since every block carries the parent's qc, we still need to get the qc for the last block though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think the call is expensive? the change looks more verbose... I also want to Arc the QC so I can avoid the cloning the QC.

blocks_to_commit
            .last()
            .expect("at least one block is required")
            .set_qc(
                self
                    .get_quorum_cert_for_block(block.id())
                    .expect("QC must be present"),
            );

        for [parent, child] in blocks_to_commit.windows(2).rev() {
            parent.set_qc(Arc::new(child.block().quorum_cert().clone()));
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah I just didn't like the Option there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, i will just expect a QC then and still use the function call.


let result = {
// If the future is completed then use the result.
if let Some(result) = fut.clone().now_or_never() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth having this early check? just appending the responders doesn't seem too bad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, makes sense. we compute it anyhow. I will remove it.

@@ -604,11 +643,13 @@ async fn process_payload_helper<T: TDataInfo>(
.batch_summary
.iter()
.map(|proof| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a proof but a data ptr right?

@@ -604,11 +643,13 @@ async fn process_payload_helper<T: TDataInfo>(
.batch_summary
.iter()
.map(|proof| {
let mut signers = proof.signers(ordered_authors);
let mut signers = signers.clone();
signers.append(&mut proof.signers(ordered_authors));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we add the qc signers to proof responders too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only passed for opt batches not proofs. proof is a data ptr. will rename.

Self::gc_previous_epoch_batches_from_db(db_clone, epoch);
});
} else {
Self::gc_expired_batches_from_db(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems more consistent to both spawn_blocking, this branch is for restart right?

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun changed the title [optqs][qs] bug fixes and perf improvements [optqs] bug fixes and perf improvements Dec 5, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ibalajiarun ibalajiarun requested a review from zekun000 December 5, 2024 00:56

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Dec 5, 2024

✅ Forge suite realistic_env_max_load success on 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a

two traffics test: inner traffic : committed: 14890.32 txn/s, latency: 2664.43 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5661660
two traffics test : committed: 99.94 txn/s, latency: 1318.32 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 1600 ms), latency samples: 1760
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.501, avg: 1.444", "ConsensusProposalToOrdered: max: 0.330, avg: 0.291", "ConsensusOrderedToCommit: max: 0.378, avg: 0.369", "ConsensusProposalToCommit: max: 0.667, avg: 0.660"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.80s no progress at version 25178 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.74s no progress at version 2777402 (avg 0.74s) [limit 16].
Test Ok

Copy link
Contributor

github-actions bot commented Dec 5, 2024

✅ Forge suite compat success on 3527aa2e299553b759c515d9843586bad48c802c ==> 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a

Compatibility test results for 3527aa2e299553b759c515d9843586bad48c802c ==> 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a (PR)
1. Check liveness of validators at old version: 3527aa2e299553b759c515d9843586bad48c802c
compatibility::simple-validator-upgrade::liveness-check : committed: 13828.85 txn/s, latency: 2443.98 ms, (p50: 1800 ms, p70: 2000, p90: 4500 ms, p99: 12700 ms), latency samples: 451020
2. Upgrading first Validator to new version: 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6437.33 txn/s, latency: 4473.62 ms, (p50: 5100 ms, p70: 5400, p90: 5600 ms, p99: 5700 ms), latency samples: 116740
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6299.62 txn/s, latency: 5255.41 ms, (p50: 5500 ms, p70: 5600, p90: 6000 ms, p99: 7300 ms), latency samples: 209760
3. Upgrading rest of first batch to new version: 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7106.64 txn/s, latency: 3977.71 ms, (p50: 4400 ms, p70: 4500, p90: 4800 ms, p99: 5100 ms), latency samples: 143620
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7268.22 txn/s, latency: 4495.55 ms, (p50: 4700 ms, p70: 4800, p90: 5800 ms, p99: 6400 ms), latency samples: 240720
4. upgrading second batch to new version: 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11594.44 txn/s, latency: 2407.06 ms, (p50: 2600 ms, p70: 2800, p90: 3100 ms, p99: 3400 ms), latency samples: 200120
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 12001.63 txn/s, latency: 2667.59 ms, (p50: 2800 ms, p70: 2900, p90: 3100 ms, p99: 3300 ms), latency samples: 387740
5. check swarm health
Compatibility test for 3527aa2e299553b759c515d9843586bad48c802c ==> 1f2e6581370ea6ef5e96ffbf290b35fd97eddd0a passed
Test Ok

}
},
Payload::QuorumStoreInlineHybrid(inline_batches, proof_with_data, _) => {
if proof_with_data.proofs.iter().any(|p| p.epoch() != epoch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's slightly better to have this?

  ensure!(iter.all(..), "...");

@@ -77,6 +77,9 @@ impl ProposalMsg {
"Proposal {} does not define an author",
self.proposal
);
self.proposal
.payload()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels it should go inside the proposal.verify_well_formed?

@@ -56,6 +56,7 @@ pub trait TPayloadManager: Send + Sync {
async fn get_transactions(
&self,
block: &Block,
block_signers: Option<BitVec>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is not vector of signers because qc only has the bitvec?

@@ -453,13 +461,15 @@ impl TPayloadManager for QuorumStorePayloadManager {
self.batch_reader.clone(),
block,
&self.ordered_authors,
block_signers.as_ref(),
)
.await?;
let proof_batch_txns = process_payload_helper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rename "process_payload_helper` and "process_payload", they're confusing lol. maybe "process_optqs_payload" and "process_qs_payload" etc

let _tracker = Tracker::new("prepare", &block);
// the loop can only be abort by the caller
let input_txns = loop {
match preparer.prepare_block(&block).await {
match preparer.prepare_block(&block, qc.clone()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably does't work, since the future is built when we receive the block, we need to pass the Arc of Mutex here and check if qc is set on every loop run

@@ -1135,6 +1135,20 @@ impl RoundManager {

pub async fn process_verified_proposal(&mut self, proposal: Block) -> anyhow::Result<()> {
let proposal_round = proposal.round();
let sync_info = self.block_store.sync_info();

if proposal_round <= sync_info.highest_round() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought inserting existing block is no-op (and will not create a vote)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants