-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PVF execution priority #4837
Conversation
Co-authored-by: Andrei Sandu <[email protected]>
Updated the logic according to #4632 (comment) which looks reasonable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, good job, I like the simplicity, I left you some comments, nothing major.
As we discussed online, it would be nice if we can run some integrations tests to understand how the system behaves on the limit conditions, I think we can simulate that by making the pvf execution longer, so that we fall behind on work.
Testing results added to the description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AndreiEres . This is very good work and the tests results show it. Just have a few more nits here and there, but otherwise good to go from my side.
@burdges any concerns with the percentages here:
// If we used parts of the original 100%, approvals can't take more than 24%, |
) | ||
} | ||
|
||
fn is_fulfilled(&self, priority: &PvfExecPriority) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is a bit confusing in the absence of comments. Why not actually return if we can pickup a work item in the specified priority queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name, we have to check if we reached a threshold for the specified priority
// A threshold in percentages, the portion a current priority can "steal" from lower ones. | ||
// For example: | ||
// Disputes take 70%, leaving 30% for approvals and all backings. | ||
// 80% of the remaining goes to approvals, which is 30% * 80% = 24% of the original 100%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of expressing the percentages is not easy. You have to do this math everytime. Can it be made simpler to not require the reader to do much math ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to update the docs to make it clearer. Unfortunately I didn't come up with a better solution.
/// | ||
/// This system might seem complex, but we operate with the remaining percentages because: | ||
/// - Not all job types are present in each block. If we used parts of the original 100%, | ||
/// approvals could not exceed 24%, even if there are no disputes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've trouble parsing "even if there are no disputes" here. If there are no backing jobs then approvals takes like 80% of the total, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we will skip jobs if we only have approvals in the queue and they have reached 80%? If every current priority has reached its limit, we will select the highest one to execute. In that case, we will continue picking approvals in the absence of other jobs.
/// - Not all job types are present in each block. If we used parts of the original 100%, | ||
/// approvals could not exceed 24%, even if there are no disputes. | ||
/// - We cannot fully prioritize backing system parachains over backing other parachains based | ||
/// on the distribution of the original 100%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we might've some gradiations among the different system parachains, so do whatever you think is best for now. AssetHub could maybe halt without killing anything, although this depends upon how its used. Among the chains that must run "enough" during an epoch:
- Sassafras tickets collection, if we ever moved it onto a parachain (maybe never).
- Validator elections should maintain a fresh validator set, even if we've lost many for some reason.
- If validator elections runs then we must run bridgehub enough, and any future DKG chains.
- Some governance chain without smart contracts (collectives?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is just a first step. I agree it needs to be e reconsidered once we start out to move governance and staking out of the relay chain.
However, I think that a better algorithm would actually consider to target specific block times for the system parachains in times of overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik, critical system paracahins should typically not require numerous blocks, specific block times, etc, but some require like one block from each validator during the epoch, including enough slack so spam cannot censor them.
This assumes the critical system paracahins cannot be tricked into wasting their own blocks, which likely becomes a risk only if the d-day governance supports smart contracts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good, thank you! 👍
const PRIORITY_ALLOCATION_THRESHOLDS: &'static [(PvfExecPriority, usize)] = &[ | ||
(PvfExecPriority::Dispute, 70), | ||
(PvfExecPriority::Approval, 80), | ||
(PvfExecPriority::BackingSystemParas, 100), | ||
(PvfExecPriority::Backing, 100), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How cool would it be to put that into the parachain host config? (Not in this PR ofc, just an idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I believe we'll have more information to make a decision after using it in production for a while. @sandreim WDYT?
/// to separate and prioritize execution jobs by request type. | ||
/// The order is important, because we iterate through the values and assume it is going from higher | ||
/// to lowest priority. | ||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, EnumIter)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are already leaking to validation what we want to have validated here, we should likely keep the old name PvfExecKind
and infer "priority" from this, together with other information. E.g. whether a validation is still worthwhile (ttl PR based on this one).
* master: (28 commits) `substrate-node`: removed excessive polkadot-sdk features (#5925) Rename QueueEvent::StartWork (#6015) [ci] Remove quick-benchmarks-omni from GitLab (#6014) Set larger timeout for cmd.yml (#6006) Fix `0003-beefy-and-mmr` test (#6003) Remove redundant XCMs from dry run's forwarded xcms (#5913) Add RadiumBlock bootnodes to Coretime Polkadot Chain spec (#5967) Bump strum from 0.26.2 to 0.26.3 (#5943) Add PVF execution priority (#4837) Snowbridge V2 docs (#5902) Fix u256 conversion in BABE (#5994) [ci] Move test-linux-stable-no-try-runtime to GHA (#5979) Bump PoV request timeout (#5924) [Release/CI] Github flow to build `polkadot`/`polkadot-parachain` rc binaries and deb package (#5963) [ci] Remove short-benchmarks from Gitlab (#5988) Disable flaky tests reported in 5972/5973/5974 (#5976) Bump some dependencies (#5886) bump zombienet version and set request for k8s (#5968) [omni-bencher] Make all runtimes work (#5872) Omni-Node renamings (#5915) ...
Resolves #4632
The new logic optimizes the distribution of execution jobs for disputes, approvals, and backings. Testing shows improved finality lag and candidate checking times, especially under heavy network load.
Approach
This update adds prioritization to the PVF execution queue. The logic partially implements the suggestions from #4632 (comment).
We use thresholds to determine how much a current priority can "steal" from lower ones:
A threshold indicates the portion of the current priority that can be allocated from lower priorities.
For example:
Assuming a maximum of 12 executions per block, with a 6-second window, 2 CPU cores, and a 2-second run time, we get these distributions:
It's worth noting that when there are no disputes, if there's only one backing job, we continue processing approvals regardless of their fulfillment status.
Versi Testing 40/20
Testing showed a slight difference in finality lag and candidate checking time between this pull request and its base on the master branch. The more loaded the network, the greater the observed difference.
Testing Parameters:
Versi Testing 80/40
For this test, we compared the master branch with the branch from #5616. The second branch is based on the current one but removes backing jobs that have exceeded their time limits. We excluded malicious nodes to reduce noise from disputing and banning validators. The results show that, under the same load, nodes experience less finality lag and reduced recovery and check time. Even parachains are functioning with a shorter block time, although it remains over 6 seconds.
Testing Parameters: