Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment of availability-chunk indices to validators #47

Merged
merged 11 commits into from
Jan 25, 2024
283 changes: 283 additions & 0 deletions text/0047-random-assignment-of-availability-chunks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
# RFC-0047: Random assignment of availability chunks to validators

| | |
| --------------- | ------------------------------------------------------------------------------------------- |
| **Start Date** | 03 November 2023 |
| **Description** | An evenly-distributing indirection layer between availability chunks and validators. |
| **Authors** | Alin Dima |

## Summary

Propose a way of randomly permuting the availability chunk indices assigned to validators for a given core and relay
chain block, in the context of
[recovering available data from systematic chunks](https://github.com/paritytech/polkadot-sdk/issues/598), with the
purpose of fairly distributing network bandwidth usage.

## Motivation

Currently, the ValidatorIndex is always identical to the ChunkIndex. Since the validator array is only shuffled once
per session, naively using the ValidatorIndex as the ChunkIndex would pose an unreasonable stress on the first N/3
validators during an entire session, when favouring availability recovery from systematic chunks.

Therefore, the relay chain node needs a deterministic way of evenly distributing the first ~(N_VALIDATORS / 3)
systematic availability chunks to different validators, based on the session, relay chain block and core.
The main purpose is to ensure fair distribution of network bandwidth usage for availability recovery in general and in
particular for systematic chunk holders.

## Stakeholders

Relay chain node core developers.

## Explanation

### Systematic erasure codes

An erasure coding algorithm is considered systematic if it preserves the original unencoded data as part of the
resulting code.
[The implementation of the erasure coding algorithm used for polkadot's availability data](https://github.com/paritytech/reed-solomon-novelpoly) is systematic.
Roughly speaking, the first N_VALIDATORS/3 chunks of data can be cheaply concatenated to retrieve the original data,
without running the resource-intensive and time-consuming reconstruction algorithm.

Here's the concatenation procedure of systematic chunks for polkadot's erasure coding algorithm
(minus error handling, for briefness):
```rust
pub fn reconstruct_from_systematic<T: Decode>(
n_validators: usize,
chunks: Vec<&[u8]>,
) -> T {
let mut threshold = (n_validators - 1) / 3;
if !is_power_of_two(threshold) {
threshold = next_lower_power_of_2(threshold);
}

let shard_len = chunks.iter().next().unwrap().len();

let mut systematic_bytes = Vec::with_capacity(shard_len * kpow2);
alindima marked this conversation as resolved.
Show resolved Hide resolved

for i in (0..shard_len).step_by(2) {
for chunk in chunks.iter().take(kpow2) {
alindima marked this conversation as resolved.
Show resolved Hide resolved
systematic_bytes.push(chunk[i]);
systematic_bytes.push(chunk[i + 1]);
}
}

Decode::decode(&mut &systematic_bytes[..]).map_err(|err| Error::Decode(err))
}
```
alindima marked this conversation as resolved.
Show resolved Hide resolved

In a nutshell, it performs a column-wise concatenation with 2-bit chunks.
alindima marked this conversation as resolved.
Show resolved Hide resolved

### Availability recovery now

According to the [polkadot protocol spec](https://spec.polkadot.network/chapter-anv#sect-candidate-recovery):

> A validator should request chunks by picking peers randomly and must recover at least `f+1` chunks, where
`n=3f+k` and `k in {1,2,3}`.

For parity's polkadot node implementation, the process was further optimised. At this moment, it works differently based
on the estimated size of the available data:

(a) for small PoVs (up to 128 Kib), sequentially try requesting the unencoded data from the backing group, in a random
order. If this fails, fallback to option (b).

(b) for large PoVs (over 128 Kib), launch N parallel requests for the erasure coded chunks (currently, N has an upper
limit of 50), until enough chunks were recovered. Validators are tried in a random order. Then, reconstruct the
original data.

### Availability recovery from systematic chunks

As part of the effort of
[increasing polkadot's resource efficiency, scalability and performance](https://github.com/paritytech/roadmap/issues/26),
work is under way to modify the Availability Recovery protocol by leveraging systematic chunks. See
[this comment](https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099) for preliminary
performance results.

In this scheme, the relay chain node will first attempt to retrieve the ~N/3 systematic chunks from the validators that
should hold them, before falling back to recovering from regular chunks, as before.

### Chunk assignment function

#### Properties

The function that decides the chunk index for a validator should be parameterized by at least
`(validator_index, relay_parent, para_id)`
and have the following properties:
alindima marked this conversation as resolved.
Show resolved Hide resolved
1. deterministic
1. pseudo-random
alindima marked this conversation as resolved.
Show resolved Hide resolved
1. relatively quick to compute and resource-efficient.
1. when considering the other params besides `validator_index` as fixed, the function should describe a random permutation
of the chunk indices
1. considering `relay_parent` as a fixed argument, the validators that map to the first N/3 chunk indices should
have as little overlap as possible for different paras scheduled on that relay parent.

In other words, we want a uniformly distributed, deterministic mapping from `ValidatorIndex` to `ChunkIndex` per block
per scheduled para.

#### Proposed runtime API

The mapping function should be implemented as a runtime API, because:

1. it enables further atomic changes to the shuffling algorithm.
1. it enables alternative client implementations (in other languages) to use it
1. considering how critical it is for parachain consensus that all validators have a common view of the Validator->Chunk
mapping, this mitigates the problem of third-party libraries changing the implementations of the `ChaCha8Rng` or the `rand::shuffle`
that could be introduced in further versions. This would stall parachains if only a portion of validators upgraded the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should obviously not rely in consensus on something that is not specified to be deterministic, like https://polkadot.network/blog/a-polkadot-postmortem-24-05-2021#the-good. And try to reduce third-party dependencies in general.
I would assume that a library that implements ChaCha8Rng adheres to the spec, otherwise it's a bug.
To mitigate supply-chain issues including bugs, we should probably use cargo-vet or a similar tool, but again, this is out of scope of and not limited to this RFC.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. What we would want is some version/spec identifier. That is only allowed to change at session boundaries. Then we can put a requirements on nodes to implement that spec and once enough clients did, we can do the switch.

While we are at it, this should probably take into account the used erasure coding itself as well. We should be able to swap it out for a better implementation if the need arises.

Copy link
Contributor

@tomaka tomaka Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is off-topic for this RFC, but as a heads up we already use ChaCha20 and shuffle for the validators gossip topology: https://github.com/paritytech/polkadot-sdk/blob/2d09e83d0703ca6bf6aba773e80ea14576887ac7/polkadot/node/network/gossip-support/src/lib.rs#L601-L610

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. What we would want is some version/spec identifier. That is only allowed to change at session boundaries. Then we can put a requirements on nodes to implement that spec and once enough clients did, we can do the switch.

While we are at it, this should probably take into account the used erasure coding itself as well. We should be able to swap it out for a better implementation if the need arises.

Yes. We'll use the new NodeFeatures runtime API for that: paritytech/polkadot-sdk#2177

If we'll make changes in the future to either the shuffling algorithm or the underlying reed-solomon algorithm, we can add a new feature bit there



Pseudocode:

```rust
pub fn get_chunk_index(
n_validators: u32,
validator_index: ValidatorIndex,
relay_parent: Hash,
para_id: ParaId
) -> ChunkIndex {
let threshold = systematic_threshold(n_validators); // Roughly n_validators/3
let seed = derive_seed(relay_parent);
let mut rng: ChaCha8Rng = SeedableRng::from_seed(seed);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Can we not deterministically arrive at an assignment based on block number and para ids, in a way that perfectly evens out load?

For example: We use block number % n_validators as a starting index. Then we take threshold validators starting from that position to be systemic chunk indices for the first para in that block. The next para starts at the additional offset threshold, so block_number + threshold % n_validators and so on.

We could shuffle validator indices before doing that, but I don't see how this gains us anything.

Now, using information that is not available as part of the candidate receipt was part of the problem we wanted to avoid. What is the "next" para, this information is not necessarily available in disputes. Essentially this means we are using the core number.

But:

  1. It usually is, most validators should have seen a block that got disputed, otherwise the candidates could never have gotten included.
  2. Disputes are not the hot path. They are an exceptional event, that should barely ever happen. It should not be an issue, if disputes are not able to use systemic chunks always or even ever.

There are other cases, e.g. collators recovering availability because the block author is censoring. But also those should be exceptional. If systemic chunk recovery would not be possible here, it would not be a huge problem either. On top of the fact that this should also not be a common case, the recovery here is also not done by validators - so worse recovery performance would not be an issue to the network.

Summary:

Systemic chunk recovery should only be important/relevant in approval voting, where we have to recover a whole lot of data every single relay chain block.
Therefore it would be good, if systemic chunks worked always, but I would consider it totally acceptable if it were an optimization that would only be supported in approval voting.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid paraid here imho, not overly keen on relay parent either.

Instead use era/session, slot, and chain spec to define the validator sequence, and then core index to define the start position in the validator sequence.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validators are already randomized at the beginning of the session. Core index for start position makes sense. What is wrong with using the block number in addition?

Reason, I would like to have it dependent on the block (could also be slot, I just don't see the benefit) is that by having a start position by core index, we ensure equal distribution of systemic chunks across a block, but paras are not all equal. Some could be heavier than others, hence it would be beneficial to change the mapping each block.

In my opinion we could also use the hash instead of the block number, here I think anything is likely better than static.

Copy link

@burdges burdges Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want slot or block I think. We're not overly adversarial here, but you can manipulate block numbers easily, while slot numbers represent an opportunity for doing something, so you always pay 19 DOTs or more to chose another one. I'd say slot not block.

We randomize the validator list based upon the randomness two epochs/sessions ago? Cool. Any idea if we similarly randomize the map from paraids to cores per era/session too? If yes, then maybe we could just progress sequentially through them?

let k = num_validators / num_cores;
let fist_validator_index_for_core = ((core_id - slot) * k % num_validators) as u32;

We'd prefer the randomization of core_id for this because otherwise you could still make hot spots. We could've hot spots even with this scheme, but not so easy to make them. We'd avoid those if we randomize per slot.

Also this exactly computation does not work due to signed vs unsigned, and the fact that it suggests things progress backwards as time progresses time, which again tried to avoid hot spots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea if we similarly randomize the map from paraids to cores per era/session too?

AFAICT from the scheduler code, we don't (at least for the parachain auction model; for on-demand, I see more complicated logic which takes into account core affinities).

We're not overly adversarial here, but you can manipulate block numbers easily, while slot numbers represent an opportunity for doing something, so you always pay 19 DOTs or more to chose another one. I'd say slot not block.

I'm having a hard time understanding the advantage of slot number vs block number. This may be simply because I don't know that much about slots. AFAICT, slots are equally useful as block number for the mapping function (monotonically increasing by 1), except that there may be slots that go unoccupied (if the chain is stalled for example) and are therefore skipped. Is that correct?

so you always pay 19 DOTs or more to chose another one

what is this fee? where can I read more about this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burdges I don't get your arguments. How can block numbers be manipulated? They are always increasing by 1, you clearly must be talking about forks. So an adversary could create multiple forks, with all the same load distribution and tries to overload validators this way? If we create forks or reversions we already have a performance problem anyway.

Really not getting how slot numbers are better here. I also don't get your argument about hot spots, my proposal above was precisely to avoid hot spots (by not using randomness). What do you mean by hot spots and how would randomness help here?

A single validator not providing its systemic chunk would be enough to break systemic recovery. I don't see how randomization schemes help here. If we were smart we could somehow track the validators withholding systemic chunks and then make an assignment where they are all pooled together into one candidate. This way, at least only one systemic recovery fails. (We could equally well just remove them entirely.)

But honestly, I would not worry too much about this here. If we ever found that validators try to mess with this on purpose, the threat is low enough that social (calling them out) and governance would be adequate measures to deal with them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the ideal would be to use the relay parent hash directly. Then we don't even need to be able to lookup the block number to determine systemic chunk indices. We will eventually have the core index in the candidate receipt - it would be really good to have this self contained at least once we have that.

Obviously hashes can be influenced trivially by block authors ... but is this a real issue? Assuming we have enough cores, load will be pretty much evenly distributed among validators no matter the hash. The only thing that changes is to which core one gets assigned. I don't mind too much if this can be influenced ... Are there any real concerns here?*)

*) As the system matures, I would assume (especially with CoreJam or similar developments) that candidates will be pretty evened out in load (maxed out). So a validator should not gain much by picking which parachain it wants to have a systemic chunk for.

Copy link
Contributor Author

@alindima alindima Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single validator not providing its systemic chunk would be enough to break systemic recovery

Slightly offtopic: In my WIP PR I added functionality to request up to 5 systematic chunks from the backing group as a backup solution, so that a couple of validators not returning their systematic chunks would not invalidate the entire procedure

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could permit requesting them all from the backing group, but make the availability providers the first choice, meaning maybe: 1st, try all systemic chunk providers. 2nd, try remaining systemic chunks from backers. 3rd, fetch random non-systemic chunks. The concern is just that we overload the backers.

It'll also make rewards somewhat more fragile, but likely worth the difficulty for the performance.

let mut chunk_indices: Vec<ChunkIndex> = (0..n_validators).map(Into::into).collect();
chunk_indices.shuffle(&mut rng);

let seed = derive_seed(hash(para_id));
let mut rng: ChaCha8Rng = SeedableRng::from_seed(seed);
let para_start_pos = rng.gen_range(0..n_validators);

chunk_indices[(para_start_pos + validator_index) % n_validators]
}
```

Additionally, so that client code is able to efficiently get the mapping from the runtime, another API will be added
for retrieving chunk indices in bulk for all validators at a given block and core:

```rust
pub fn get_chunk_indices(
n_validators: u32,
relay_parent: Hash,
para_id: ParaId
) -> Vec<ChunkIndex> {
let threshold = systematic_threshold(n_validators); // Roughly n_validators/3
let seed = derive_seed(relay_parent);
let mut rng: ChaCha8Rng = SeedableRng::from_seed(seed);
let mut chunk_indices: Vec<ChunkIndex> = (0..n_validators).map(Into::into).collect();
chunk_indices.shuffle(&mut rng);

let seed = derive_seed(hash(para_id));
let mut rng: ChaCha8Rng = SeedableRng::from_seed(seed);

let para_start_pos = rng.gen_range(0..n_validators);

chunk_indices
.into_iter()
.cycle()
.skip(para_start_pos)
.take(n_validators)
.collect()
}
```

#### Upgrade path

Considering that the Validator->Chunk mapping is critical to para consensus, the change needs to be enacted atomically
via governance, only after all validators have upgraded the node to a version that is aware of this mapping.
It needs to be explicitly stated that after the runtime upgrade and governance enactment, validators that run older
client versions that don't support this mapping will not be able to participate in parachain consensus.
alindima marked this conversation as resolved.
Show resolved Hide resolved

Additionally, an error will be logged when starting a validator with an older version, after the runtime was upgraded and the feature enabled.

## Drawbacks

- In terms of guaranteeing even load distribution, a simpler function that chooses the per-core start position in the
shuffle as `threshold * core_index` would likely perform better, but considering that the core_index is not part of the
CandidateReceipt, the implementation would be too complicated. More details in [Appendix A](#appendix-a).
- Considering future protocol changes that aim to generalise the work polkadot is doing (like CoreJam), `ParaId`s may be
removed from the protocol, in favour of more generic primitives. In that case, `ParaId`s in the availability recovery
process should be replaced with a similar identifier. It's important to note that the implementation is greatly simplified
if this identifier is part of the `CandidateReceipt` or the future analogous data structure.
- It's a breaking change that requires most validators to be upgrade their node version.
alindima marked this conversation as resolved.
Show resolved Hide resolved

## Testing, Security, and Privacy

Extensive testing will be conducted - both automated and manual.
This proposal doesn't affect security or privacy.

## Performance, Ergonomics, and Compatibility

### Performance

This is a necessary data availability optimisation, as reed-solomon erasure coding has proven to be a top consumer of
CPU time in polkadot as we scale up the parachain block size and number of availability cores.

With this optimisation, preliminary performance results show that CPU time used for reed-solomon coding can be halved
and total POV recovery time decrease by 80% for large POVs. See more
[here](https://github.com/paritytech/polkadot-sdk/issues/598#issuecomment-1792007099).

### Ergonomics

Not applicable.

### Compatibility

This is a breaking change. See [upgrade path](#upgrade-path) section above.
All validators need to have upgraded their node versions before the feature will be enabled via a runtime upgrade and
governance call.

## Prior Art and References

See comments on the [tracking issue](https://github.com/paritytech/polkadot-sdk/issues/598) and the
[in-progress PR](https://github.com/paritytech/polkadot-sdk/pull/1644)

## Unresolved Questions

- Is it the best option to embed the mapping function in the runtime?
- Is there a better upgrade path that would preserve backwards compatibility?
- Is usage of `ParaId` the best choice for spreading out the network load during systematic chunk recovery within the
same block?

## Future Directions and Related Material

This enables future optimisations for the performance of availability recovery, such as retrieving batched systematic
chunks from backers/approval-checkers.

## Appendix A

This appendix explores alternatives to using the `ParaId` as the factor by which availability chunk indices are
distributed to validators within the same relay chain block, and why they weren't chosen.

### Core index

Here, `core_index` refers to the index of the core that a candidate was occupying while it was pending availability
(from backing to inclusion).

Availability-recovery can currently be triggered by the following phases in the polkadot protocol:
1. During the approval voting process.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as recovery is possible in disputes, it should be fine if it can not always use systemic recovery. What is missing for me, is to understand why this needs to be an incompatible change. Shouldn't recovery be possible always, even if you don't know what the systemic chunks are?

Copy link
Contributor

@ordian ordian Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, when you request a chunk, you need to know which chunk to request. If you specify a wrong chunk, you'll get an empty response. So everyone should be onboard how chunks are shuffled in order for recovery to work, not just from systematic chunks. Unless we rely on backers to have all the chunks, which we can't in case of a dispute.

We could add an additional request type to request any chunk you have. That could probably be a reasonable fallback in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an additional request type to request any chunk you have. That could probably be a reasonable fallback in this case.

I don't think this achieves the purpose of availability-recovery working with an arbitrary mix of upgraded & non-upgraded validators. Adding a new request type (or even changing the meaning of the existing one) would also mean a node upgrade (because responding to the new type of request is only useful for the nodes that haven't yet upgraded to use the new mapping). So validators might as well just upgrade to the version that uses the new shuffle. I'm not sure this has any benefit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant as a fallback in case we want to use core_ids which might be not readily available in some cases of disputes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant as a fallback in case we want to use core_ids which might be not readily available in some cases of disputes.

I see. In this case, it would be a reasonable fallback. But this would mean giving up systematic recovery in those cases. I'm wondering if it's worth doing all of this just to replace the para-id thingy with the core_id

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh fine. We'll loose the mappings sometime before the data expires? We should probably avoid doing that. We can obviously ask everyone and account for the DLs differently, but I'd prefer to be more like bittorrent trackers here, so that parachains can fetch data this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll loose the mappings sometime before the data expires?

It can happen if a validator imports disputes statements for a dispute ongoing on some fork that the node has not imported yet.
Or even if a validator was just launched and has no data recorded about what core the block was occupying while pending availability.

@ordian or @eskimor correct me if I'm wrong

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All validators will know the correct mapping, if they have seen (and still know) the relay parent. There is a good chance that this is the case even in disputes, but it is not guaranteed and recovery must work even in that case (but can be slower).

Using the core index should be fine, what makes that even more acceptable to me is that we actually want (and need to) include the core index in the candidate receipt at some point. So the moment we have this change, then the block availability will cease to matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the core index should be fine, what makes that even more acceptable to me is that we actually want (and need to) include the core index in the candidate receipt at some point.

That's needed for CoreJam I would assume?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elastic scaling also.

1. By other collators of the same parachain.
1. During disputes.

Getting the right core index for a candidate is troublesome. Here's a breakdown of how different parts of the
node implementation can get access to it:

1. The approval-voting process for a candidate begins after observing that the candidate was included. Therefore, the
node has easy access to the block where the candidate got included (and also the core that it occupied).
1. The `pov_recovery` task of the collators starts availability recovery in response to noticing a candidate getting
backed, which enables easy access to the core index the candidate started occupying.
1. Disputes may be initiated on a number of occasions:

3.a. is initiated by the validator as a result of finding an invalid candidate while participating in the
approval-voting protocol. In this case, availability-recovery is not needed, since the validator already issued their
vote.

3.b is initiated by the validator noticing dispute votes recorded on-chain. In this case, we can safely
assume that the backing event for that candidate has been recorded and kept in memory.

3.c is initiated as a result of getting a dispute statement from another validator. It is possible that the dispute
is happening on a fork that was not yet imported by this validator, so the subsystem may not have seen this candidate
being backed.

A naive attempt of solving 3.c would be to add a new version for the disputes request-response networking protocol.
Blindly passing the core index in the network payload would not work, since there is no way of validating that
the reported core_index was indeed the one occupied by the candidate at the respective relay parent.

Another attempt could be to include in the message the relay block hash where the candidate was included.
This information would be used in order to query the runtime API and retrieve the core index that the candidate was
occupying. However, considering it's part of an unimported fork, the validator cannot call a runtime API on that block.