-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Scheduler to Support Relay Chain Block Number Provider #6362
base: master
Are you sure you want to change the base?
Conversation
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> { | |||
return | |||
} | |||
|
|||
let mut incomplete_since = now + One::one(); | |||
let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why would not it work with IncompleteSince
, without the block Queue
?
How we determine the MaxScheduledBlocks
bound?
With the IncompleteSince
we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the
IncompleteSince
we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Yes, but then this becomes unbounded in case too many blocks are skipped. The idea behind using the Queue
is to bound this to a sufficient number.
How we determine the MaxScheduledBlocks bound?
This should be determined similar to the existing MaxScheduledPerBlock
?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There is already a retry mechanism and the task is purged if the retry count is exceeded (even if failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue
not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50
, we can schedule only 50
jobs with distinct schedule time.
The MaxScheduledPerBlock
for me seems simpler to define. Because the block size its exiting constrain the system have. But how many distinct schedule time points you can have is something new.
Retries work in case if a certain task fails while it's function call is being executed (not the scheduler fail). I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time
Indeed, I do not find it quite comfortable to run a for
loop with IncompleteSince
when there could be an unknown number of blocks passed between the successive runs. You could always keep the MaxScheduledBlocks
on the higher side that would give you a similar experience?
I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
But this stays as an issue even in the current implementation? The change here just makes it bounded, so that the scheduling itself is blocked in such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can put a quite big bound on the MaxScheduledBlocks
, it is just a vec of block numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, indeed it is bad for PoV, as it is read every block.
The situation we want to fix is when the scheduler is using the relay chain block, and the parachain doesn't execute often.
(1) Maybe in this case the scheduler should use a different block provider with less granularity like relay chain block / 100
so that when doing IncompleteSince
it increments with a step of 100 relay chain block until it arrives to now.
(2) Or otherwise we can have a more complex structure for the queue. We cut the vector in chunck of 100 blocks.
So we have a double map with first key is block number / 100
and second key is block number % 100
, the value is a vector of length at most 100.
But still if the parachains wake up every month it can be not good. But at this point they should use (1).
EDIT: I agree we can also just ignore this situation with a MaxStaleTaskAge
parameter. IMO it is fine. And people can do (1) if their parachain executes too much rarely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added MaxStaleTaskAge
as suggested. Thanks both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, indeed it is bad for PoV, as it is read every block.
Also the task scheduling is affected.
The situation we want to fix is when the scheduler is using the relay chain block, and the parachain doesn't execute often.
I think we have next cases today/planned soon:
- Relay Chain with scheduler working with local block provider. No concerns. The new
Queue
is even redundant; - Parachain with scheduler working with local block provider. Same as (1);
- Parachain with scheduler working with Relay Chain block provider;
3.1 runs scheduler on every second RC block, same as (1);
3.2 RC or Parachain for some reason is not producing blocks for 2 hours, we have1200
blocks to iterate through.
We have a problem with (3.2) case only. On the current version (without Queue
) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period). With the Queue
such situation as (3.2) gonna be handled well, but with a cost.
I would look into numbers, if with the current version we can handle 2 hours of overdue in some reasonable time (lets say 10 blocks), then I think we are fine even with current solution, we just need tests for it. If not, may be we can introduce the Queue
in a way that it can be disabled for (1) and (2) cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked that we currently use Scheduler
only for the Governance related pallets. I think the related tasks should be better eventually processed than dropped if too old. So MaxStaleTaskAge
should be at least optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the referenda pallet creates an alarm for every ref to check the voting turnout.
We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).
Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince
.
Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.
Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.
|
||
#[pallet::storage] | ||
pub type IncompleteSince<T: Config> = StorageValue<_, BlockNumberFor<T>>; | ||
/// Provider for the block number. Normally this is the `frame_system` pallet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally in what case? Parachain or relay/solo?
/// Provider for the block number. Normally this is the `frame_system` pallet. | ||
type BlockNumberProvider: BlockNumberProvider; | ||
|
||
/// The maximum number of blocks that can be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any hints on how to configure this? Parachain teams will read this and not know what number to put.
#[pallet::constant] | ||
type MaxScheduledBlocks: Get<u32>; | ||
|
||
/// The maximum number of blocks that a task can be stale for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe a hint for a sane default value.
/// The queue of block numbers that have scheduled agendas. | ||
#[pallet::storage] | ||
pub(crate) type Queue<T: Config> = | ||
StorageValue<_, BoundedVec<BlockNumberFor<T>, T::MaxScheduledBlocks>, ValueQuery>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if one vector is enough? I think the referenda pallet creates an alarm for each ref...
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> { | |||
return | |||
} | |||
|
|||
let mut incomplete_since = now + One::one(); | |||
let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the referenda pallet creates an alarm for every ref to check the voting turnout.
We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).
Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince
.
Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.
Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.
Step in #6297
This PR adds the ability for the Scheduler pallet to specify its source of the block number. This is needed for the scheduler pallet to work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that the block number increases monotonically, and thus have a new logic via a
Queue
to handle multiple blocks with valid agenda passed between them.This change only needs a migration for the
Queue
:BlockNumberProvider
continues to use the system pallet's block numberrelay chain's block number
However, we would need separate migrations if the deployed pallets are upgraded on an existing parachain, and the
BlockNumberProvider
uses the relay chain block number.Todo