-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slot-based-collator: Allow multiple blocks per slot #7569
base: master
Are you sure you want to change the base?
Conversation
/cmd prdoc --audience node_operator --bump major |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is heavily inspired by the tests introduced for polkadot. However, I wanted to go with a version that is simpler to run by using the dynamic subxt feature. It is a bit more prone to breaking, but these tests should clearly fail if something changes, and it gets rid of the build.rs
, env variables and zombie-metadata
feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skunert does make sense to apply the same changes to the helper in polkadot dir?
Thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO yes, we could unify all of this and get rid of the overhead. But it is a bit opinionated, and typically we had each team maintain their tests, so did not want to make that decision alone. If @alindima also finds it useful, we can unify in a follow-up (would like to keep the scope small here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st round, will get back to it.
|
||
#[cfg(feature = "elastic-scaling-multi-block-slot")] | ||
parameter_types! { | ||
pub const MinimumPeriod: u64 = SLOT_DURATION / 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dq: why 6? This gives us support for max 6 cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the time between blocks will be at least MINIMUM_PERIOD
. If the inherent gives a smaller time than PREVIOUS_TIME + MINIMUM_PERIOD
, then the time is set to PREVIOUS_TIME + MINIMUM_PERIOD
. In order to produce that many blocks in a single slot, you need to make sure that the minimum period does not push the timestamp into the next slot, otherwise you will be greeted with a slot mismatch in the runtime.
@@ -91,6 +90,7 @@ pub struct BuilderTaskParams< | |||
pub authoring_duration: Duration, | |||
/// Channel to send built blocks to the collation task. | |||
pub collator_sender: sc_utils::mpsc::TracingUnboundedSender<CollatorMessage<Block>>, | |||
pub relay_chain_slot_duration: Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc is coming soon, right?
cumulus/client/consensus/aura/src/collators/slot_based/slot_timer.rs
Outdated
Show resolved
Hide resolved
Co-authored-by: Michal Kucharczyk <[email protected]>
cumulus/client/consensus/aura/src/collators/slot_based/slot_timer.rs
Outdated
Show resolved
Hide resolved
cumulus/client/consensus/aura/src/collators/slot_based/slot_timer.rs
Outdated
Show resolved
Hide resolved
/cmd fmt |
let para_slots_per_relay_block = | ||
(relay_slot_duration.as_millis() / para_slot_duration.as_millis() as u128) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return 0 for a para_slot_duration
> relay_slot_duration
, which is not good.
|
||
// Trigger at least once per relay block, if we have for example 12 second slot duration, | ||
// we should still produce two blocks if we are scheduled on every relay block. | ||
let mut block_production_interval = min(para_slot_duration.as_duration(), relay_slot_duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really makes sense. This is pure guessing and then we could also just get rid of para_slot_duration
entirely and only do it based on the scheduled blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which you kind of doing here already any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is pure guessing
The introduction of these "subslots" is of course pure guessing. There is not correct way per se, we just try to find a point in time when we want to author, can be anytime.
get rid of para_slot_duration
Some things to consider:
para_slot_duration
determines which AURA slot should be outputted, this still needs to be correct- I still wanted to support the case where we have a lower slot duration. For example it would be strange if the slot duration is 1000, so 6 authors per relay block. But then we only see two cores, which would lead to the first and third author to author if we purely use relay slot duration and core count. In these cases I want to respect the slot duration and make the first two authors author. Not that it really matters for the fixed-scaling use case, but I find it less surprising this way.
pub async fn wait_until_next_slot(&self) -> Result<SlotInfo, ()> { | ||
let Ok(slot_duration) = crate::slot_duration(&*self.client) else { | ||
tracing::error!(target: LOG_TARGET, "Failed to fetch slot duration from runtime."); | ||
return Err(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function could just return an Option
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but IMO this is an error where we failed to fetch the slot duration.
/// Use with care, this flag is unstable and subject to change. | ||
#[arg(long)] | ||
pub experimental_use_slot_based: bool, | ||
|
||
/// Authoring style to use. | ||
#[arg(long, default_value_t = AuthoringStyle::Lookahead)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why then not make the slot based the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lookahead still builds on forks, I don't want to touch the normal async-backing chains here. This here is for stabilizing the elastic-scaling use-case, I don't want to mess with too many things at once.
IMO we can make slot-based the default in the release after next.
|
||
/// Collator implementation to use. | ||
#[derive(PartialEq, Debug, ValueEnum, Clone, Copy)] | ||
pub enum AuthoringStyle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would it be worth to avoid duplication of this enum? super-nit: And maybe policy would be better phrasing?
@@ -139,7 +141,7 @@ fn main() -> Result<(), sc_cli::Error> { | |||
consensus, | |||
collator_options, | |||
true, | |||
cli.experimental_use_slot_based, | |||
use_slot_based_collator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to pass AuthoringStyle
?
name = "collator-elastic" | ||
image = "{{COL_IMAGE}}" | ||
command = "test-parachain" | ||
args = ["-laura=trace,runtime=info,cumulus-consensus=trace,consensus::common=trace,parachain::collation-generation=trace,parachain::collator-protocol=trace,parachain=debug", "--force-authoring", "--authoring", "slot-based"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: line breaks would improve readability here.
nit: maybe mentioning some recommended values in PR description for the following items would make it easier to integrate?
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
Summary: This PR enables authoring of multiple blocks in one AURA slot in the slot-based collator and stabilizes the slot-based collator.
CLI Changes
The flag
--experimental-use-slot-based
is now marked as deprecated. I opted to introduce--authoring slot-based
instead of just removing theexperimental
prefix. By introducing theauthoring
variant, we get some future-proofing in case we want to introduce further options.Change Description
With elastic-scaling, we are able to author multiple blocks with a single relay-chain parent. In the initial iteration, the interval between two blocks was determined by the
slot_duration
of the parachain. This PR introduces a more flexible model, where we try to author multiple blocks in a single slot if the runtime allows it.The block authoring loop is largely the same. The
SlotTimer
now lives in a separate module and is updated with the last seen core count. It will then trigger rounds in the block-building loop based on the core count.This allows some flexibility where elastic-scaling chains can run on a single core in quiet times. Previously, running on 1 core with a 3-core elastic-scaling chain would result in authors getting skipped because the
slot_duration
was too low.Parameter Considerations
The core logic does not change, so there are a few things to consider:
ConsensusHook
implementation still determines how many blocks are allowed per relay-chain block. So if you add arbitrary cores to an async-backing, 6-second parachain,can_build_upon
in the runtime will deny block-building of additional blocks.MINIMUM_PERIOD
in the runtime needs to be configured to allow enough blocks in the slot. A "classic" configuration ofSLOT_DURATION/2
will lead to slot mismatches when running with 3 cores.