-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull gossip queues for better throughput #5195
Conversation
Performance Report✔️ no performance regression detected Full benchmark results
|
Deployed to Queues are never full, message drop rate is 0, and job wait time is almost 0 But mesh peers is now at minimum and don't have peers on most subnets We receive way less messages in the beacon_attestation topic Peer quality is much different, we have mostly nodes with no validators connected EDIT: From Tuyen
|
peers are not pruned because:
this is the same issue to #5198 The reason it happens more to this branch maybe because gossip score it so great (point 3 above - I compared gossipsub score in feat3 and other groups) |
9a5b31b
to
99cb497
Compare
packages/api/package.json
Outdated
@@ -11,7 +11,7 @@ | |||
"bugs": { | |||
"url": "https://github.com/ChainSafe/lodestar/issues" | |||
}, | |||
"version": "1.4.3", | |||
"version": "1.5.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why it shows the diff here while unstable
really have version as 1.5.0
, should not be an issue anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased on unstable and fixed the bad diff
This is awesome work, I can see good metrics like "Job Wait Time" or "Dropped jobs %". However I also notice this push some pressure to I/O lag issue on #4002:
|
@dapplion is it possible to introduce some flags to mitigate the issue? |
0e803d0
to
d04463a
Compare
If any this PR is allowing nodes to process all network objects they are required to process as per the spec. "The issue" is actually correct behavior, where current stable drops too many messages. To mitigate "the issue" we should force the node into processing less messages than they can, sort of re-introducing a forced limiting throughput. I'm softly against introducing data, and instead spend our energy understanding what's taking CPU time. If we were to add flags we could limit the max concurrency per topic. |
0301_gossip-queue-pull_1k.cpuprofile.zip I'm attaching the profile, there is no special thing except for In the current implementation we yield to the macro queue every 50ms through |
That should not be necessary in theory, I added metrics and each gossip packet tends to trigger work on itself. From looking at the profile I don't see issues related with this code specifically. Since everything is run sync now it makes sense to show up on the call stack of handleReceivedMessages and executeWork but their self time is small |
gossipHandlers?: GossipHandlers; | ||
}; | ||
|
||
export class NetworkWorker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth adding a tsdoc comment here describing the purpose or mentioning that this isn't related to thread workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah naming sucks, I think this specific class should be merged into the NetworkProcessor
maxGossipTopicConcurrency?: number; | ||
}; | ||
|
||
const executeGossipWorkOrderObj: Record<GossipType, true> = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth comment here mentioning the order here defines the order that gossip topics will be processed
}; | ||
|
||
export type NetworkProcessorOpts = GossipHandlerOpts & { | ||
maxGossipTopicConcurrency?: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to have a tsdoc here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. Can you follow up w a PR to add all new metrics to dashboard?
It seems we'll have to tune things based on what we see there.
This reverts commit f03f911.
🎉 This PR is included in v1.6.0 🎉 |
🎉 This PR is included in v1.7.0 🎉 |
🎉 This PR is included in v1.8.0 🎉 |
Motivation
Lodestar's gossip queues are too restrictive to handle big volumes of objects such when subscribed to all subnets
Our existing queues throttle by limiting concurrency, irrespective of downstream worker's capacity. They have the authority to push jobs, and do so independently from each other
Description
This PR updates the queue design to:
Marking as draft until testing shows all good