Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backing: Back-off from backing on approval checking lag #2908

Closed
wants to merge 2 commits into from

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Jan 10, 2024

If approval checking falls behind a certain threshold, it means that the network could not process the assignments and approvals needed for approving the block fast enough, so we need to back-off from creating new work, to give the opportunity of the approvals subsystems to catch up.

Continuously, creating new work is not a good idea because of the way approvals subsystems work, so if the system is slow on processing the assignments and approvals for the current block, either because we are behind on work from previous blocks or because the network is slow, validators will simply trigger new tranches which in turn causes more delays so we are going to create the conditions for the system to never catch up and fall behind.

Hence, why we need a mechanism to ensure that instead of falling more and more behind we actually allow the system to automatically catch up and start working in optimal conditions. This PR achieves that, by abstaining from backing new candidates if the node is behind on approvals beyond a certain threshold.

TODO:

[ ] Is approval checking lag the right signal for deciding to back-off on backing ?
[ ] Test behaviour in real-world conditions

If approval checking falls behind a certain threshold, it means that the
network could not process the assignments and approvals needed for
approving the block fast enough, so we need to back-off from creating
new work, to give the opportunity of the approvals subsystems to catch
up.

Continously, creating new work is not a good idea because of the way
approvals subsystems work, so if the system is slow on processing the assignments
and approvals for the current block, either because we are behind on work
from previous blocks or because the network is slow, validators will
simply trigger new tranches which in turn causes more delays so we are
going to create the conditions for the system to never catch up and fall
behind.

Hence, why we need a mechanism to ensure that instead of falling
more and more behind we actually allow the system to automatically catch up
and start working in optimal conditions. This PR achieves that, by
abstaining from backing new candidates if the node is behind on
approvals beyond a certain threshold.

Signed-off-by: Alexandru Gheorghe <[email protected]>
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we are still paying the costs of executing candidates in backing and dealing with statement gossip while we don't put them on chain. Even if there is some parameter and selection of backed candidates, this approach only considers the view of a single node.

Ideally, if we include finality proofs on chain then we can let the runtime do the job of selecting which candidates to back (for example always back system parachains).

As a smaller step I would recommend to skip backing the core the node is assigned to when the approval checking lag is too high. We are experimenting with decreasing backing group size. With a backing group of 3 we reasonably expect this form of back pressure is efficient and related to the actual network performance/load, but still falls short on being able to back only specific candidates.

@Overkillus
Copy link
Contributor

Undeniably there is a need for a back-off mechanism. Also if there will be a back-off mechanism it definitely should limit the the entry-point of more candidates which is backing. Those two aspects I think pretty much everyone agrees on.

As Sandreim points out:

this approach only considers the view of a single node.

This is also a problem because it effectively makes it an opt-in back-off scheme. Malicious nodes can continue backing. This is an issue when considering cases where all honest nodes back-off due to a large approval-lag / big approval-queue but malicious nodes continue backing and they lengthen this backed-off state on purpose because it allows them to monopolize backing.

Fundamentally this issue is very very similar to the discussions around the finality lag and block-authoring back-off. I have very similar fears.

I'm definitely not saying that finality proofs are the only way, but they are the nicest and cleanest way. That allows for easy audits and hopefully less mistakes on weird edge-cases as well as no concerns around balancing opt-in systems which can be abused.

Signed-off-by: Alexandru Gheorghe <[email protected]>
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4938609

@alexggh
Copy link
Contributor Author

alexggh commented Jan 17, 2024

Try 2

Changes:

  • Updated the implementation with @sandreim suggestion where the backing subsystem backs-off from backing the candidate they are assigned to if approval-distribution informed them they are overloaded.
  • Modified the signal we use to decide if we should back-off, instead of using the ApprovalCheckingLag, make ApprovalDistribution notify backing system when they have a high-backlog of unprocessed messages, I think this a better signal than ApprovalCheckingLag, because ApprovalCheckingLag in the presence of no-shows, but if we are able to process the messages we receive there is no reason to worry and back-off from backing.

Addressing concerns:

Use on-chain finality proofs.

Agree that having finality proofs on-chain would allow us to implement this in the cleanest and safest way, where we can prove that all nodes respected the behaviours, however that is a longer-term solution that would take some significant efort and resources to implement, in the meanwhile I do think that having this short-term solution would give us enough benefits and they won't make the situation worse.

What if malicious nodes don't comply ?

Would the mechanism achieve its purpose ?

Our network assumes 2/3 honest validators, so I would say that should be enough to allow nodes to catch up on work, we have bigger problems if we can't handle approving work backed by only 1/3 of the nodes.

Can malicious nodes control/lenghten this state somehow ?

approval-distribution has spam mechanisms which prevents nodes to be overwhelmed with bogus messages, so this state would be trigger only if approval-distribution gets overwhelmed with valid expected messages, but since 2/3 of validators would back-off from backing their allocated parachains we should actually catch-up.

What will uncompliant nodes gain from this ?

Not much, since this just prevents honest validators from Seconding, they would still be allow to issue Valid statements if a candidate has been Seconded(done by malicious nodes), so there is no change in the mechanism of checking the integrity of a candidate and also in the era points rewards, since both Seconded and Valid statements are rewarded the same.

How does affect liveness of System Parachains ?

Since System Parachains are doing important work on behalf of the network, this PR proposes that we make an exception for System Parachains and continue seconding them.

@Overkillus @sandreim @eskimor Let me know what you think.

@alexggh alexggh marked this pull request as ready for review January 17, 2024 10:40
@eskimor
Copy link
Member

eskimor commented Feb 13, 2024

As mentioned this is only an interim solution, which should be improved once we have finality proofs on chain. Hence I would love comments in the code referencing the corresponding ticket and a hint in the ticket that this code should be replaced once we have proofs: For future readers it should be 100% clear when this code becomes obsolete, anyone implementing on-chain proofs should know that this code can now be removed/replaced.

@bkchr
Copy link
Member

bkchr commented Jul 17, 2024

@alexggh do we still want this?

@alexggh
Copy link
Contributor Author

alexggh commented Jul 17, 2024

@alexggh do we still want this?

No, it is deprecated in favor of: #4632

@alexggh alexggh closed this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants