Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat performance issue #256

Closed
twoeths opened this issue May 27, 2022 · 12 comments
Closed

heartbeat performance issue #256

twoeths opened this issue May 27, 2022 · 12 comments
Assignees

Comments

@twoeths
Copy link
Contributor

twoeths commented May 27, 2022

Motivation

In a contabo node of 50 peers, heartbeat function may take up to 800ms - 1s which block the event loop

Screen Shot 2022-05-27 at 16 59 20

part of ChainSafe/lodestar#4002

Description

  • Write performance test for heartbeat() function
  • Improve heartbeat() performance
  • If it takes too much time, yield to event loop every x milli seconds
@dapplion
Copy link
Contributor

@tuyennhv The resolution of the buckets in gossipsub_heartbeat_duration_seconds_bucket is really poor: [0.01, 0.1, 1]. So the quantile function can show widely inaccurate results. We should improve the bucket values and check this metric again. Since heartbeats are about 1s and we care about not looping around better bucket values would be

[0.01, 0.05, 0.1, 0.2, 0.4, 0.7, 1]

As gossipsub_heartbeat_duration_seconds_bucket doesn't have any labels and is called infrequently, more buckets are cheap

@dapplion
Copy link
Contributor

Screenshot from 2022-05-28 14-12-48

This chart is also a bit worrying: heartbeat should not be skipped, and should run 1/0.7 = 1.42 times / sec on average. Smooth chart below

Screenshot from 2022-05-28 14-14-43

@dapplion
Copy link
Contributor

The only hard data we have now is how many heartbeats take > 100ms to complete and how many take > 1000ms to complete. However you can see that many take > 100ms but almost 0 take > 1000 ms. But we don't really now the distribution between 100 and 1000 at all 🤷

Screenshot from 2022-05-28 14-26-19
Screenshot from 2022-05-28 14-26-36

@twoeths
Copy link
Contributor Author

twoeths commented Jun 1, 2022

with new buckets [0.01, 0.05, 0.1, 0.2, 0.4, 0.7, 1], the distribution in a contabo-S (with rate interval as 5m) looks like

Screen Shot 2022-06-01 at 15 36 48

@twoeths
Copy link
Contributor Author

twoeths commented Jun 2, 2022

was able to catch a 775ms heartbeat, it shows that most of the time of heartbeat() function is due to emitGossip() call
Screen Shot 2022-06-02 at 09 49 48

0522_contabo_20_subscribe_all_subnets.cpuprofile.zip

right now we call emitGossip() 2 times in heartbeat() inside the fanout loop and mesh loop, and we may gossip messages to a peer twice in the same heartbeat

Update: metrics show that fanout map's size is consistently, however we still need to improve the logic

@dapplion
Copy link
Contributor

dapplion commented Jun 2, 2022

Why is emitGossip so costly? This needs benchmark testing. emitGossip is taking 2% of total CPU time

Screenshot from 2022-06-02 12-23-10

@twoeths
Copy link
Contributor Author

twoeths commented Jun 2, 2022

most of the cost is for looping through topic peers

Screen Shot 2022-06-02 at 14 09 58

and for a lodestar node connecting to all attestation subnets, there are about 70 topics in total => we call emitGossip() ~70 times per heartbeat

@twoeths
Copy link
Contributor Author

twoeths commented Jun 15, 2022

for the record, runHeartBeat() takes 3.6% of cpu time on a lodestar's subscribeAllSubnets node, should take another profile when all of the improvements are merged

Screen Shot 2022-06-15 at 10 21 05

0522_contabo_20_subscribe_all_subnets.cpuprofile.zip

@stale
Copy link

stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta-stale label Aug 13, 2022
@twoeths
Copy link
Contributor Author

twoeths commented Aug 14, 2022

need to see if this still happens in the latest release (4.0.0 as of Aug 2022)

@stale stale bot removed the meta-stale label Aug 14, 2022
@twoeths
Copy link
Contributor Author

twoeths commented Aug 25, 2022

Heartbeat duration is improved significantly (v0.14.0 vs v4.1.1)

Screen Shot 2022-08-25 at 10 17 56

@twoeths
Copy link
Contributor Author

twoeths commented Sep 28, 2022

This metric is incredible in v4.1.1

Screen Shot 2022-09-28 at 09 50 12

@twoeths twoeths closed this as completed Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants