Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slot-based-collator: Allow multiple blocks per slot #7569

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

skunert
Copy link
Contributor

@skunert skunert commented Feb 13, 2025

Summary: This PR enables authoring of multiple blocks in one AURA slot in the slot-based collator and stabilizes the slot-based collator.

CLI Changes

The flag --experimental-use-slot-based is now marked as deprecated. I opted to introduce --authoring slot-based instead of just removing the experimental prefix. By introducing the authoring variant, we get some future-proofing in case we want to introduce further options.

Change Description

With elastic-scaling, we are able to author multiple blocks with a single relay-chain parent. In the initial iteration, the interval between two blocks was determined by the slot_duration of the parachain. This PR introduces a more flexible model, where we try to author multiple blocks in a single slot if the runtime allows it.

The block authoring loop is largely the same. The SlotTimer now lives in a separate module and is updated with the last seen core count. It will then trigger rounds in the block-building loop based on the core count.

This allows some flexibility where elastic-scaling chains can run on a single core in quiet times. Previously, running on 1 core with a 3-core elastic-scaling chain would result in authors getting skipped because the slot_duration was too low.

Parameter Considerations

The core logic does not change, so there are a few things to consider:

  • The ConsensusHook implementation still determines how many blocks are allowed per relay-chain block. So if you add arbitrary cores to an async-backing, 6-second parachain, can_build_upon in the runtime will deny block-building of additional blocks.
  • The MINIMUM_PERIOD in the runtime needs to be configured to allow enough blocks in the slot. A "classic" configuration of SLOT_DURATION/2 will lead to slot mismatches when running with 3 cores.
  • We fetch available cores at least once every relay chain block. So if a parachain runs with a 12-second slot duration and 1 fixed core, we would still author 2 blocks if the parachain runtime allows it.

@skunert skunert added T9-cumulus This PR/Issue is related to cumulus. T0-node This PR/Issue is related to the topic “node”. labels Feb 13, 2025
@skunert
Copy link
Contributor Author

skunert commented Feb 14, 2025

/cmd prdoc --audience node_operator --bump major

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is heavily inspired by the tests introduced for polkadot. However, I wanted to go with a version that is simpler to run by using the dynamic subxt feature. It is a bit more prone to breaking, but these tests should clearly fail if something changes, and it gets rid of the build.rs, env variables and zombie-metadata feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skunert does make sense to apply the same changes to the helper in polkadot dir?
Thx!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO yes, we could unify all of this and get rid of the overhead. But it is a bit opinionated, and typically we had each team maintain their tests, so did not want to make that decision alone. If @alindima also finds it useful, we can unify in a follow-up (would like to keep the scope small here).

Copy link
Contributor

@michalkucharczyk michalkucharczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st round, will get back to it.

cumulus/zombienet/zombienet-sdk/Cargo.toml Outdated Show resolved Hide resolved
cumulus/zombienet/zombienet-sdk/README.md Outdated Show resolved Hide resolved

#[cfg(feature = "elastic-scaling-multi-block-slot")]
parameter_types! {
pub const MinimumPeriod: u64 = SLOT_DURATION / 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: why 6? This gives us support for max 6 cores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the time between blocks will be at least MINIMUM_PERIOD. If the inherent gives a smaller time than PREVIOUS_TIME + MINIMUM_PERIOD, then the time is set to PREVIOUS_TIME + MINIMUM_PERIOD. In order to produce that many blocks in a single slot, you need to make sure that the minimum period does not push the timestamp into the next slot, otherwise you will be greeted with a slot mismatch in the runtime.

cumulus/test/runtime/Cargo.toml Show resolved Hide resolved
@@ -91,6 +90,7 @@ pub struct BuilderTaskParams<
pub authoring_duration: Duration,
/// Channel to send built blocks to the collation task.
pub collator_sender: sc_utils::mpsc::TracingUnboundedSender<CollatorMessage<Block>>,
pub relay_chain_slot_duration: Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc is coming soon, right?

Co-authored-by: Michal Kucharczyk <[email protected]>
@skunert skunert requested review from alindima and a team February 14, 2025 17:27
@skunert skunert marked this pull request as ready for review February 14, 2025 17:28
@skunert
Copy link
Contributor Author

skunert commented Feb 14, 2025

/cmd fmt

Comment on lines +77 to +78
let para_slots_per_relay_block =
(relay_slot_duration.as_millis() / para_slot_duration.as_millis() as u128) as u32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return 0 for a para_slot_duration > relay_slot_duration, which is not good.


// Trigger at least once per relay block, if we have for example 12 second slot duration,
// we should still produce two blocks if we are scheduled on every relay block.
let mut block_production_interval = min(para_slot_duration.as_duration(), relay_slot_duration);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really makes sense. This is pure guessing and then we could also just get rid of para_slot_duration entirely and only do it based on the scheduled blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which you kind of doing here already any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is pure guessing

The introduction of these "subslots" is of course pure guessing. There is not correct way per se, we just try to find a point in time when we want to author, can be anytime.

get rid of para_slot_duration

Some things to consider:

  • para_slot_duration determines which AURA slot should be outputted, this still needs to be correct
  • I still wanted to support the case where we have a lower slot duration. For example it would be strange if the slot duration is 1000, so 6 authors per relay block. But then we only see two cores, which would lead to the first and third author to author if we purely use relay slot duration and core count. In these cases I want to respect the slot duration and make the first two authors author. Not that it really matters for the fixed-scaling use case, but I find it less surprising this way.

pub async fn wait_until_next_slot(&self) -> Result<SlotInfo, ()> {
let Ok(slot_duration) = crate::slot_duration(&*self.client) else {
tracing::error!(target: LOG_TARGET, "Failed to fetch slot duration from runtime.");
return Err(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function could just return an Option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but IMO this is an error where we failed to fetch the slot duration.

/// Use with care, this flag is unstable and subject to change.
#[arg(long)]
pub experimental_use_slot_based: bool,

/// Authoring style to use.
#[arg(long, default_value_t = AuthoringStyle::Lookahead)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why then not make the slot based the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookahead still builds on forks, I don't want to touch the normal async-backing chains here. This here is for stabilizing the elastic-scaling use-case, I don't want to mess with too many things at once.

IMO we can make slot-based the default in the release after next.


/// Collator implementation to use.
#[derive(PartialEq, Debug, ValueEnum, Clone, Copy)]
pub enum AuthoringStyle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it be worth to avoid duplication of this enum? super-nit: And maybe policy would be better phrasing?

@@ -139,7 +141,7 @@ fn main() -> Result<(), sc_cli::Error> {
consensus,
collator_options,
true,
cli.experimental_use_slot_based,
use_slot_based_collator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to pass AuthoringStyle?

name = "collator-elastic"
image = "{{COL_IMAGE}}"
command = "test-parachain"
args = ["-laura=trace,runtime=info,cumulus-consensus=trace,consensus::common=trace,parachain::collation-generation=trace,parachain::collator-protocol=trace,parachain=debug", "--force-authoring", "--authoring", "slot-based"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line breaks would improve readability here.

@michalkucharczyk
Copy link
Contributor

nit: maybe mentioning some recommended values in PR description for the following items would make it easier to integrate?

The ConsensusHook implementation still determines how many blocks are allowed per relay-chain block. So if you add arbitrary cores to an async-backing, 6-second parachain, can_build_upon in the runtime will deny block-building of additional blocks.
The MINIMUM_PERIOD in the runtime needs to be configured to allow enough blocks in the slot. A "classic" configuration of SLOT_DURATION/2 will lead to slot mismatches when running with 3 cores.

@pepoviola pepoviola requested review from a team as code owners February 18, 2025 11:34
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/13389644567
Failed job name: fmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T9-cumulus This PR/Issue is related to cumulus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants