Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Scheduler to Support Relay Chain Block Number Provider #6362

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

gupnik
Copy link
Contributor

@gupnik gupnik commented Nov 5, 2024

Step in #6297

This PR adds the ability for the Scheduler pallet to specify its source of the block number. This is needed for the scheduler pallet to work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that the block number increases monotonically, and thus have a new logic via a Queue to handle multiple blocks with valid agenda passed between them.

This change only needs a migration for the Queue:

  1. If the BlockNumberProvider continues to use the system pallet's block number
  2. When a pallet deployed on the relay chain is moved to a parachain, but still uses the
    relay chain's block number

However, we would need separate migrations if the deployed pallets are upgraded on an existing parachain, and the BlockNumberProvider uses the relay chain block number.

Todo

  • Update Benchmarks
  • Migration

@gupnik gupnik added the T1-FRAME This PR/Issue is related to core FRAME, the framework. label Nov 5, 2024
@gupnik gupnik requested a review from a team as a code owner November 5, 2024 09:02
@muharem muharem self-requested a review November 5, 2024 09:49
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> {
return
}

let mut incomplete_since = now + One::one();
let mut when = IncompleteSince::<T>::take().unwrap_or(now);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why would not it work with IncompleteSince, without the block Queue?
How we determine the MaxScheduledBlocks bound?
With the IncompleteSince we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the IncompleteSince we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?

Yes, but then this becomes unbounded in case too many blocks are skipped. The idea behind using the Queue is to bound this to a sufficient number.

How we determine the MaxScheduledBlocks bound?

This should be determined similar to the existing MaxScheduledPerBlock?

Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.

There is already a retry mechanism and the task is purged if the retry count is exceeded (even if failed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time.

The MaxScheduledPerBlock for me seems simpler to define. Because the block size its exiting constrain the system have. But how many distinct schedule time points you can have is something new.

Retries work in case if a certain task fails while it's function call is being executed (not the scheduler fail). I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time

Indeed, I do not find it quite comfortable to run a for loop with IncompleteSince when there could be an unknown number of blocks passed between the successive runs. You could always keep the MaxScheduledBlocks on the higher side that would give you a similar experience?

I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?

But this stays as an issue even in the current implementation? The change here just makes it bounded, so that the scheduling itself is blocked in such a case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can put a quite big bound on the MaxScheduledBlocks, it is just a vec of block numbers.

Copy link
Contributor

@gui1117 gui1117 Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, indeed it is bad for PoV, as it is read every block.

The situation we want to fix is when the scheduler is using the relay chain block, and the parachain doesn't execute often.

(1) Maybe in this case the scheduler should use a different block provider with less granularity like relay chain block / 100 so that when doing IncompleteSince it increments with a step of 100 relay chain block until it arrives to now.

(2) Or otherwise we can have a more complex structure for the queue. We cut the vector in chunck of 100 blocks.
So we have a double map with first key is block number / 100 and second key is block number % 100, the value is a vector of length at most 100.

But still if the parachains wake up every month it can be not good. But at this point they should use (1).

EDIT: I agree we can also just ignore this situation with a MaxStaleTaskAge parameter. IMO it is fine. And people can do (1) if their parachain executes too much rarely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added MaxStaleTaskAge as suggested. Thanks both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, indeed it is bad for PoV, as it is read every block.

Also the task scheduling is affected.

The situation we want to fix is when the scheduler is using the relay chain block, and the parachain doesn't execute often.

I think we have next cases today/planned soon:

  1. Relay Chain with scheduler working with local block provider. No concerns. The new Queue is even redundant;
  2. Parachain with scheduler working with local block provider. Same as (1);
  3. Parachain with scheduler working with Relay Chain block provider;
    3.1 runs scheduler on every second RC block, same as (1);
    3.2 RC or Parachain for some reason is not producing blocks for 2 hours, we have 1200 blocks to iterate through.

We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period). With the Queue such situation as (3.2) gonna be handled well, but with a cost.

I would look into numbers, if with the current version we can handle 2 hours of overdue in some reasonable time (lets say 10 blocks), then I think we are fine even with current solution, we just need tests for it. If not, may be we can introduce the Queue in a way that it can be disabled for (1) and (2) cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked that we currently use Scheduler only for the Governance related pallets. I think the related tasks should be better eventually processed than dropped if too old. So MaxStaleTaskAge should be at least optional.

Copy link
Member

@ggwpez ggwpez Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the referenda pallet creates an alarm for every ref to check the voting turnout.

We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).

Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince.

Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.

Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.

@paritytech-review-bot paritytech-review-bot bot requested a review from a team November 11, 2024 10:51
@gupnik gupnik changed the title [WIP]: Update Scheduler to Support Relay Chain Block Number Provider #3970 Update Scheduler to Support Relay Chain Block Number Provider Nov 14, 2024

#[pallet::storage]
pub type IncompleteSince<T: Config> = StorageValue<_, BlockNumberFor<T>>;
/// Provider for the block number. Normally this is the `frame_system` pallet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally in what case? Parachain or relay/solo?

/// Provider for the block number. Normally this is the `frame_system` pallet.
type BlockNumberProvider: BlockNumberProvider;

/// The maximum number of blocks that can be scheduled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any hints on how to configure this? Parachain teams will read this and not know what number to put.

#[pallet::constant]
type MaxScheduledBlocks: Get<u32>;

/// The maximum number of blocks that a task can be stale for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe a hint for a sane default value.

/// The queue of block numbers that have scheduled agendas.
#[pallet::storage]
pub(crate) type Queue<T: Config> =
StorageValue<_, BoundedVec<BlockNumberFor<T>, T::MaxScheduledBlocks>, ValueQuery>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if one vector is enough? I think the referenda pallet creates an alarm for each ref...

@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> {
return
}

let mut incomplete_since = now + One::one();
let mut when = IncompleteSince::<T>::take().unwrap_or(now);
Copy link
Member

@ggwpez ggwpez Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the referenda pallet creates an alarm for every ref to check the voting turnout.

We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).

Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince.

Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.

Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T1-FRAME This PR/Issue is related to core FRAME, the framework.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants