backing: Back-off from backing on approval checking lag #2908

alexggh · 2024-01-10T15:25:09Z

If approval checking falls behind a certain threshold, it means that the network could not process the assignments and approvals needed for approving the block fast enough, so we need to back-off from creating new work, to give the opportunity of the approvals subsystems to catch up.

Continuously, creating new work is not a good idea because of the way approvals subsystems work, so if the system is slow on processing the assignments and approvals for the current block, either because we are behind on work from previous blocks or because the network is slow, validators will simply trigger new tranches which in turn causes more delays so we are going to create the conditions for the system to never catch up and fall behind.

Hence, why we need a mechanism to ensure that instead of falling more and more behind we actually allow the system to automatically catch up and start working in optimal conditions. This PR achieves that, by abstaining from backing new candidates if the node is behind on approvals beyond a certain threshold.

TODO:

[ ] Is approval checking lag the right signal for deciding to back-off on backing ?
[ ] Test behaviour in real-world conditions

If approval checking falls behind a certain threshold, it means that the network could not process the assignments and approvals needed for approving the block fast enough, so we need to back-off from creating new work, to give the opportunity of the approvals subsystems to catch up. Continously, creating new work is not a good idea because of the way approvals subsystems work, so if the system is slow on processing the assignments and approvals for the current block, either because we are behind on work from previous blocks or because the network is slow, validators will simply trigger new tranches which in turn causes more delays so we are going to create the conditions for the system to never catch up and fall behind. Hence, why we need a mechanism to ensure that instead of falling more and more behind we actually allow the system to automatically catch up and start working in optimal conditions. This PR achieves that, by abstaining from backing new candidates if the node is behind on approvals beyond a certain threshold. Signed-off-by: Alexandru Gheorghe <[email protected]>

sandreim

With this change we are still paying the costs of executing candidates in backing and dealing with statement gossip while we don't put them on chain. Even if there is some parameter and selection of backed candidates, this approach only considers the view of a single node.

Ideally, if we include finality proofs on chain then we can let the runtime do the job of selecting which candidates to back (for example always back system parachains).

As a smaller step I would recommend to skip backing the core the node is assigned to when the approval checking lag is too high. We are experimenting with decreasing backing group size. With a backing group of 3 we reasonably expect this form of back pressure is efficient and related to the actual network performance/load, but still falls short on being able to back only specific candidates.

Overkillus · 2024-01-15T12:57:24Z

Undeniably there is a need for a back-off mechanism. Also if there will be a back-off mechanism it definitely should limit the the entry-point of more candidates which is backing. Those two aspects I think pretty much everyone agrees on.

As Sandreim points out:

this approach only considers the view of a single node.

This is also a problem because it effectively makes it an opt-in back-off scheme. Malicious nodes can continue backing. This is an issue when considering cases where all honest nodes back-off due to a large approval-lag / big approval-queue but malicious nodes continue backing and they lengthen this backed-off state on purpose because it allows them to monopolize backing.

Fundamentally this issue is very very similar to the discussions around the finality lag and block-authoring back-off. I have very similar fears.

I'm definitely not saying that finality proofs are the only way, but they are the nicest and cleanest way. That allows for easy audits and hopefully less mistakes on weird edge-cases as well as no concerns around balancing opt-in systems which can be abused.

Signed-off-by: Alexandru Gheorghe <[email protected]>

paritytech-cicd-pr · 2024-01-17T07:49:50Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4938609

alexggh · 2024-01-17T10:40:11Z

Try 2

Changes:

Updated the implementation with @sandreim suggestion where the backing subsystem backs-off from backing the candidate they are assigned to if approval-distribution informed them they are overloaded.
Modified the signal we use to decide if we should back-off, instead of using the ApprovalCheckingLag, make ApprovalDistribution notify backing system when they have a high-backlog of unprocessed messages, I think this a better signal than ApprovalCheckingLag, because ApprovalCheckingLag in the presence of no-shows, but if we are able to process the messages we receive there is no reason to worry and back-off from backing.

Addressing concerns:

Use on-chain finality proofs.

Agree that having finality proofs on-chain would allow us to implement this in the cleanest and safest way, where we can prove that all nodes respected the behaviours, however that is a longer-term solution that would take some significant efort and resources to implement, in the meanwhile I do think that having this short-term solution would give us enough benefits and they won't make the situation worse.

What if malicious nodes don't comply ?

Would the mechanism achieve its purpose ?

Our network assumes 2/3 honest validators, so I would say that should be enough to allow nodes to catch up on work, we have bigger problems if we can't handle approving work backed by only 1/3 of the nodes.

Can malicious nodes control/lenghten this state somehow ?

approval-distribution has spam mechanisms which prevents nodes to be overwhelmed with bogus messages, so this state would be trigger only if approval-distribution gets overwhelmed with valid expected messages, but since 2/3 of validators would back-off from backing their allocated parachains we should actually catch-up.

What will uncompliant nodes gain from this ?

Not much, since this just prevents honest validators from Seconding, they would still be allow to issue Valid statements if a candidate has been Seconded(done by malicious nodes), so there is no change in the mechanism of checking the integrity of a candidate and also in the era points rewards, since both Seconded and Valid statements are rewarded the same.

How does affect liveness of System Parachains ?

Since System Parachains are doing important work on behalf of the network, this PR proposes that we make an exception for System Parachains and continue seconding them.

@Overkillus @sandreim @eskimor Let me know what you think.

eskimor · 2024-02-13T20:23:50Z

As mentioned this is only an interim solution, which should be improved once we have finality proofs on chain. Hence I would love comments in the code referencing the corresponding ticket and a hint in the ticket that this code should be replaced once we have proofs: For future readers it should be 100% clear when this code becomes obsolete, anyone implementing on-chain proofs should know that this code can now be removed/replaced.

bkchr · 2024-07-17T12:43:00Z

@alexggh do we still want this?

alexggh · 2024-07-17T12:55:13Z

@alexggh do we still want this?

No, it is deprecated in favor of: #4632

sandreim reviewed Jan 15, 2024

View reviewed changes

Try 2

641af64

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh marked this pull request as ready for review January 17, 2024 10:40

alexggh closed this Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backing: Back-off from backing on approval checking lag #2908

backing: Back-off from backing on approval checking lag #2908

alexggh commented Jan 10, 2024

sandreim left a comment

Overkillus commented Jan 15, 2024

paritytech-cicd-pr commented Jan 17, 2024

alexggh commented Jan 17, 2024

eskimor commented Feb 13, 2024

bkchr commented Jul 17, 2024

alexggh commented Jul 17, 2024

backing: Back-off from backing on approval checking lag #2908

backing: Back-off from backing on approval checking lag #2908

Conversation

alexggh commented Jan 10, 2024

TODO:

sandreim left a comment

Choose a reason for hiding this comment

Overkillus commented Jan 15, 2024

paritytech-cicd-pr commented Jan 17, 2024

alexggh commented Jan 17, 2024

Try 2

Changes:

Addressing concerns:

Use on-chain finality proofs.

What if malicious nodes don't comply ?

Would the mechanism achieve its purpose ?

Can malicious nodes control/lenghten this state somehow ?

What will uncompliant nodes gain from this ?

How does affect liveness of System Parachains ?

eskimor commented Feb 13, 2024

bkchr commented Jul 17, 2024

alexggh commented Jul 17, 2024