Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assignment of availability-chunk indices to validators #47
Assignment of availability-chunk indices to validators #47
Changes from 7 commits
e00b475
52903f3
a3ccf05
a196ddc
c57a8b6
b3a4868
28511f6
303368b
2395237
90b0a50
4ae7529
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should obviously not rely in consensus on something that is not specified to be deterministic, like https://polkadot.network/blog/a-polkadot-postmortem-24-05-2021#the-good. And try to reduce third-party dependencies in general.
I would assume that a library that implements
ChaCha8Rng
adheres to the spec, otherwise it's a bug.To mitigate supply-chain issues including bugs, we should probably use
cargo-vet
or a similar tool, but again, this is out of scope of and not limited to this RFC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. What we would want is some version/spec identifier. That is only allowed to change at session boundaries. Then we can put a requirements on nodes to implement that spec and once enough clients did, we can do the switch.
While we are at it, this should probably take into account the used erasure coding itself as well. We should be able to swap it out for a better implementation if the need arises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is off-topic for this RFC, but as a heads up we already use
ChaCha20
andshuffle
for the validators gossip topology: https://github.com/paritytech/polkadot-sdk/blob/2d09e83d0703ca6bf6aba773e80ea14576887ac7/polkadot/node/network/gossip-support/src/lib.rs#L601-L610There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We'll use the new
NodeFeatures
runtime API for that: paritytech/polkadot-sdk#2177If we'll make changes in the future to either the shuffling algorithm or the underlying reed-solomon algorithm, we can add a new feature bit there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Can we not deterministically arrive at an assignment based on block number and para ids, in a way that perfectly evens out load?
For example: We use
block number % n_validators
as a starting index. Then we takethreshold
validators starting from that position to be systemic chunk indices for the first para in that block. The next para starts at the additional offsetthreshold
, soblock_number + threshold % n_validators
and so on.We could shuffle validator indices before doing that, but I don't see how this gains us anything.
Now, using information that is not available as part of the candidate receipt was part of the problem we wanted to avoid. What is the "next" para, this information is not necessarily available in disputes. Essentially this means we are using the core number.
But:
There are other cases, e.g. collators recovering availability because the block author is censoring. But also those should be exceptional. If systemic chunk recovery would not be possible here, it would not be a huge problem either. On top of the fact that this should also not be a common case, the recovery here is also not done by validators - so worse recovery performance would not be an issue to the network.
Summary:
Systemic chunk recovery should only be important/relevant in approval voting, where we have to recover a whole lot of data every single relay chain block.
Therefore it would be good, if systemic chunks worked always, but I would consider it totally acceptable if it were an optimization that would only be supported in approval voting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid paraid here imho, not overly keen on relay parent either.
Instead use era/session, slot, and chain spec to define the validator sequence, and then core index to define the start position in the validator sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validators are already randomized at the beginning of the session. Core index for start position makes sense. What is wrong with using the block number in addition?
Reason, I would like to have it dependent on the block (could also be slot, I just don't see the benefit) is that by having a start position by core index, we ensure equal distribution of systemic chunks across a block, but paras are not all equal. Some could be heavier than others, hence it would be beneficial to change the mapping each block.
In my opinion we could also use the hash instead of the block number, here I think anything is likely better than static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want slot or block I think. We're not overly adversarial here, but you can manipulate block numbers easily, while slot numbers represent an opportunity for doing something, so you always pay 19 DOTs or more to chose another one. I'd say slot not block.
We randomize the validator list based upon the randomness two epochs/sessions ago? Cool. Any idea if we similarly randomize the map from paraids to cores per era/session too? If yes, then maybe we could just progress sequentially through them?
We'd prefer the randomization of core_id for this because otherwise you could still make hot spots. We could've hot spots even with this scheme, but not so easy to make them. We'd avoid those if we randomize per slot.
Also this exactly computation does not work due to signed vs unsigned, and the fact that it suggests things progress backwards as time progresses time, which again tried to avoid hot spots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT from the scheduler code, we don't (at least for the parachain auction model; for on-demand, I see more complicated logic which takes into account core affinities).
I'm having a hard time understanding the advantage of slot number vs block number. This may be simply because I don't know that much about slots. AFAICT, slots are equally useful as block number for the mapping function (monotonically increasing by 1), except that there may be slots that go unoccupied (if the chain is stalled for example) and are therefore skipped. Is that correct?
what is this fee? where can I read more about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@burdges I don't get your arguments. How can block numbers be manipulated? They are always increasing by 1, you clearly must be talking about forks. So an adversary could create multiple forks, with all the same load distribution and tries to overload validators this way? If we create forks or reversions we already have a performance problem anyway.
Really not getting how slot numbers are better here. I also don't get your argument about hot spots, my proposal above was precisely to avoid hot spots (by not using randomness). What do you mean by hot spots and how would randomness help here?
A single validator not providing its systemic chunk would be enough to break systemic recovery. I don't see how randomization schemes help here. If we were smart we could somehow track the validators withholding systemic chunks and then make an assignment where they are all pooled together into one candidate. This way, at least only one systemic recovery fails. (We could equally well just remove them entirely.)
But honestly, I would not worry too much about this here. If we ever found that validators try to mess with this on purpose, the threat is low enough that social (calling them out) and governance would be adequate measures to deal with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the ideal would be to use the relay parent hash directly. Then we don't even need to be able to lookup the block number to determine systemic chunk indices. We will eventually have the core index in the candidate receipt - it would be really good to have this self contained at least once we have that.
Obviously hashes can be influenced trivially by block authors ... but is this a real issue? Assuming we have enough cores, load will be pretty much evenly distributed among validators no matter the hash. The only thing that changes is to which core one gets assigned. I don't mind too much if this can be influenced ... Are there any real concerns here?*)
*) As the system matures, I would assume (especially with CoreJam or similar developments) that candidates will be pretty evened out in load (maxed out). So a validator should not gain much by picking which parachain it wants to have a systemic chunk for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly offtopic: In my WIP PR I added functionality to request up to 5 systematic chunks from the backing group as a backup solution, so that a couple of validators not returning their systematic chunks would not invalidate the entire procedure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could permit requesting them all from the backing group, but make the availability providers the first choice, meaning maybe: 1st, try all systemic chunk providers. 2nd, try remaining systemic chunks from backers. 3rd, fetch random non-systemic chunks. The concern is just that we overload the backers.
It'll also make rewards somewhat more fragile, but likely worth the difficulty for the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as recovery is possible in disputes, it should be fine if it can not always use systemic recovery. What is missing for me, is to understand why this needs to be an incompatible change. Shouldn't recovery be possible always, even if you don't know what the systemic chunks are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, when you request a chunk, you need to know which chunk to request. If you specify a wrong chunk, you'll get an empty response. So everyone should be onboard how chunks are shuffled in order for recovery to work, not just from systematic chunks. Unless we rely on backers to have all the chunks, which we can't in case of a dispute.
We could add an additional request type to request any chunk you have. That could probably be a reasonable fallback in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this achieves the purpose of availability-recovery working with an arbitrary mix of upgraded & non-upgraded validators. Adding a new request type (or even changing the meaning of the existing one) would also mean a node upgrade (because responding to the new type of request is only useful for the nodes that haven't yet upgraded to use the new mapping). So validators might as well just upgrade to the version that uses the new shuffle. I'm not sure this has any benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant as a fallback in case we want to use
core_id
s which might be not readily available in some cases of disputes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In this case, it would be a reasonable fallback. But this would mean giving up systematic recovery in those cases. I'm wondering if it's worth doing all of this just to replace the para-id thingy with the core_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh fine. We'll loose the mappings sometime before the data expires? We should probably avoid doing that. We can obviously ask everyone and account for the DLs differently, but I'd prefer to be more like bittorrent trackers here, so that parachains can fetch data this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can happen if a validator imports disputes statements for a dispute ongoing on some fork that the node has not imported yet.
Or even if a validator was just launched and has no data recorded about what core the block was occupying while pending availability.
@ordian or @eskimor correct me if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. All validators will know the correct mapping, if they have seen (and still know) the relay parent. There is a good chance that this is the case even in disputes, but it is not guaranteed and recovery must work even in that case (but can be slower).
Using the core index should be fine, what makes that even more acceptable to me is that we actually want (and need to) include the core index in the candidate receipt at some point. So the moment we have this change, then the block availability will cease to matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's needed for CoreJam I would assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elastic scaling also.