Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assignment of availability-chunk indices to validators #47
Assignment of availability-chunk indices to validators #47
Changes from 8 commits
e00b475
52903f3
a3ccf05
a196ddc
c57a8b6
b3a4868
28511f6
303368b
2395237
90b0a50
4ae7529
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this remove the end zero padding if odd length?
Anyways we should be careful about the boundary here, like maybe this should live in the erasure coding crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, scale decoding ignores the trailing zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to know the exact number of zeroed bytes added as padding, we need to know the size of the input data that was encoded.
Unfortunately, we don't have easy access to that in polkadot, unless we add it to the
CandidateReceipt
.But scale decoding it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some roundtrip quickcheck tests on regular chunk reconstruction. The regular reed-solomon code can have extra zeroed padding when reconstructing. So truncation to the expected size was already needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we used block hash (relay parent), then by putting the core index into the candidate receipt we would have everything needed for the lookup to be self contained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok using the relay parent of the candidate is at the very least not ideal because with async backing, the used relay parent could vary from candidate to candidate in the same block, which means that the equal load distribution might end up not being that equal after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, we could avoid using block hash or number entirely and just use the core index. Assuming all cores are occupied, it will just mean fixed mapping within a session from validators to bulk parachains chunks (on-demand would still rotate their mapping to cores I assume). That way it might be easier to screw (intentionally or not) the systematic recovery for a particular parachain for an entire session. OTOH, we need to handle one (or two) missing systematic recovery chunk in practice and fall back to using backers to some extent. Maybe worth mentioning the fallback bit in the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as you said, I think it would be too easy to screw up. It could result in a not-so-even load distribution, because sessions can be quite long and some validators would be too unlucky to be assigned to a high-throughput para for several hours. We also don't know which validators are lazier or ill-intended and switching only once per session would make this visible for some parachains more than the others.
Yes, I added this bit to the RFC. I suggest we request at most one chunk from each validator in the backing group as a fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is whether this matters? Yes indeed, validators could then consistently mess with the systemic recovery of one particular parachain, instead of messing with systemic recovery of multiple parachains.... so what?
Now the only real argument for rotation every block in my opinion is, again to even out load. In this particular case it would make a difference, if some paras always fully fill their blocks, while others are keeping them mostly empty. But, I would argue that we we should solve this problem by incentivizing full block utilization and not worry about this here too much, at least not until it manifests in a real problem. In fact, we also have another way of solving this if it ever proves beneficial: We could rotate para id to core id assignments instead.
TL;DR: I like @ordian 's idea to just not rotate. There are downsides to this, but having the recovery be self contained is quite valuable. Let's start simple and go only more complex if it proves necessary?
@burdges am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we'd like availability-recovery to be possible when having nothing more than the candidate receipt. The problem is that the receipt does not contain any info about the slot or block number.
For the moment, the core index isn't there either, but the plan is to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you. We could only use some historical slot, not the slot where the candidate recipet gets backed.
All these options give some control of course, even the slot where backed, so relay parent hash works better than any slot choice, due to simplicity.
As I understand it @eskimor suggests our map depend upon only num_cores, num_validators, and core index, so each core has their systemic validators fixed for the session. It avoids bias except through selecting your core. I'd wager it worsens user experence, but only slightly.
We do however rotate backing groups and backers provide systemic chunks too, hence the slightly above. How do we determin the backers from the candidate recipet? Just because they signed the candidate recipet?
It's likely fine either way. We'll anyways have systemic reconstruction if each systemic chunk can be fetched from some backers or its one availability provider.
Actually who determines core index? We'd ideas where this occurs after the candidate reciept. We'd enable these if the map depends upon only num_cores, num_validators, relay parent, and paraid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @eskimor said, "using the relay parent of the candidate is at the very least not ideal because with async backing, the used relay parent could vary from candidate to candidate in the same block, which means that the equal load distribution might end up not being that equal after all."
yes, that's the suggestion AFAIU.
No, we only use the backers currently during approval-voting, when we have access to the relay block and we can query the session's validator groups from the runtime. For complexity reasons, we don't even verify that validators in the group all backed the candidate. We just assume that to be true (in practice, it's mostly true).
Currently, for the existing lease parachains, there's a fixed assignment between the paraid and the core index. @eskimor you know more here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just go with only the core index, from the discussion I don't see any real problems with that and once we have the core index in the candidate receipt we are golden from the simplicity/robustness perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, I'll update the document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be stay the same IIUC in v2, but the meaning (and the doc comment) will be different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will stay the same, indeed. The meaning will change, in the sense that we cannot expect any more that the ValidatorIndex will be equal to the returned ChunkIndex. I hope I describe this in sufficient detail in the following paragraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The responder does not even need to do that, if we keep storing per validator index in the av store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does. It needs to supply the chunk index to the requester (for verification purposes and because the reconstruction algorithm seems to need it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have sworn, that I wrote somewhere that we would need to store the chunk index with the chunk in the av-store for this of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. yeah, that makes sense now 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a bit more.
The backing subsystem will issue a request to av-store to store all chunks. For that, we need to know the core index in order to store the ValidatorIndex -> Chunk as you suggest (so that we can compute the mapping).
Since we don't yet have the core index in the receipt, the backing subsystem needs to know the per-relay parent core_index assignment of the local validator.
From my knowledge, this would be just fine. When doing attestation, the availabiliy_cores runtime API already gets the core_index for us (but doesn't store it yet).
The slight caveat is that, when importing a statement, we may also have to call the availability_cores runtime API to see which core our para has been scheduled on. but it's no big deal, we need to have access to the relay parent anyway when doing candidate validation.
@eskimor please weigh in on my analysis. until elastic scaling, a para id couldn't be scheduled on multiple cores and the core assignment could only change on a relay-block boundary. And when we'll get elastic scaling, we'll have the core_index in the receipt anyway. so all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a heads up that parachains may take a very long time to upgrade their collators's polkadot-sdk branch, many of them are still based on 0.9.x versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think collators have to upgrade as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
I think they just need to upgrade to the v2 networking protocol. Once that's done, they don't really need to upgrade to use systematic recovery. They'll request regular chunks as before.
That's why I added this bit only on the step 2 (which enables systematic recovery). Upgrade for step 1 will still be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly unrelated, but collators don't use recovery in a happy path. In can be needed in case there are malicious collators withholding the data. So it doesn't need to be optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, to me that sounded like an upgrade. In other words: If you already have the requirement that they upgrade their network protocol, the rest does not sound like an issue anymore.