Extreme I/O lag on overloaded nodes #4002

dapplion · 2022-05-10T13:29:14Z

NodeJS due to its single threaded nature experiences exponential degradation of performance under heavy load. We have circumstantial evidence that in low power machines, when attempting to run too many keys performance suffers. Overall time to perform tasks increases, but any network call (vc call beacon api) time increases by x10, x100 respective to internal function times (block processing)

By running large amounts of keys on not shitty servers the issue is not apparent. However this issue still poses a future risk

for solo stakers attempting to run Lodestar in under-powered machines
post-merge due to extra calls to EL and mev-boost on time critical code paths

Step 1: quantify

Circumstantial evidence is not enough, we need proper documented metrics on when this issue manifest and how bad it is. We will run Lodestar in different configurations and different servers:

Validator configurations:

beacon 1 validator
beacon 1 validator --subscribeAllSubnets
beacon + 8 validators
beacon + 16 validators
beacon + 32 validators
beacon + 64 validators

Servers:

Contabo Cloud VPS S
Contabo Cloud VPS M
Hetzner cloud CX41

Data to collect:

Validator request times histogram vc_rest_api_client_request_time_seconds_bucket{routeId="produceAttestationData"}. Average, median, p95, p99.

Step 2: reproduce

If the data above proves that this issue of sufficient severity we need to have a reproducible simple case to collect help and potential solutions. A reproducible case must be:

Standalone repo
Contain no Lodestar specific code, nor any library at all, pure vanilla NodeJS code
Create similar conditions to which Lodestar experiences: very high load caused by many short async tasks
Show that requests times by an external caller (in a separate process) show a similar pattern to the data collected in step 1

Step 3: mitigate

Disclaimer: The root solution here is to reduce Lodestar load via optimizations of offloading to workers.

Provided that a Lodestar is under heavy load we want to minimize this I/O lag to acceptable levels, such that validator performance doesn't degrade. If an VC request time increases from 50ms to 500ms, it's tolerable. However if it increases from 50ms to 5000ms, the validator may miss an attestation and reduce profitability.

Related issues

May be the cause of this issues:

The text was updated successfully, but these errors were encountered:

twoeths · 2022-05-22T09:53:14Z

Contabo machines

- Network: Prater
- Slot range: 4 days until [3078725](https://www.beaconcha.in/block/3078725)
- Machine: Contabo - S
- Run config: systemd
- Lodestar version: v0.36.0

#	Avg (12h)	Median (12h)	p95 (12h)	p99 (12h)	% > 1s	% > 5s
Node with 1 validator	100ms - 200ms	48ms-63ms	600ms to 3s	3.5s - 5s	2% - 9%	0% - 3%
Node with 1 validator + subscribeAllSubnets	1.8s - 2.4s	890ms - 1.7s	5s	5s	46% - 57%	7% - 19%
Node with 8 validators	130ms - 174ms	80ms - 110ms	923ms - 940ms	2s - 3.2s	1.2% - 2%	0% - 0.25%
Node with 16 validators	190ms - 330ms	190ms - 330ms	900ms - 1s	1.4s - 4s	1.1% - 4%	0% - 0.6%
Node with 32 validators	320ms - 500ms	300ms - 450ms	3s - 3.5s	4.8s-5s	4.7% - 11%	0.7% - 1.5%
Node with 64 validators	0.6s - 1.9s	0.4s - 1.3s	4s - 5s	5s	13% - 53%	2% - 11%

There is huge difference between a regular beacon node of 1 validator and one with 1 validator and subscribeAllSubnets (which is equivalent to 64 validators)

- Network: Prater
- Slot range: 2 days until [3093675](https://www.beaconcha.in/block/3093675)
- Machine: Contabo - S
- Run config: systemd
- Lodestar version: v0.37.0.beta.0

#	Avg (12h)	Median (12h)	p95 (12h)	p99 (12h)	% > 1s	% > 5s
Node with 1 validator	100ms - 380ms	40ms - 64ms	600ms - 1.6s	3.5s - 5s	2.6% - 5.4%	0% - 2.7%
Node with 1 validator + subscribeAllSubnets	850ms	550ms	4.4s	5s	28%	0.9%
Node with 8 validators	140ms - 160ms	60ms - 65ms	900ms - 920ms	3s - 3.5s	2% - 2.5%	0% - 0.13%
Node with 16 validators	180ms - 400ms	90ms - 350ms	1s - 2s	5s	2% - 6%	0% - 1.4%
Node with 32 validators	370ms - 660ms	200ms - 460ms	2.5s - 4.4s	5s	7.4% - 17.8%	0.9% - 2.4%
Node with 64 validators	220ms - 570ms	80ms - 280ms	990ms - 4.1s	5s	3.8% - 15.8%	0.8% - 2.3%

Hetzner machines

- Network: Prater
- Slot range: 2.5 days until [3094125](https://www.beaconcha.in/block/3094125)
- Machine: Hetzner
- Run config: Docker
- Lodestar version: v0.36.0

#	Avg (12h)	Median (12h)	p95 (12h)	p99 (12h)	% > 1s	% > 5s
Node with 916 validators	13ms - 20ms	8.3ms - 10ms	89ms - 93ms	100ms - 170ms	0% - 0.4%	0%

- Network: Prater
- Slot range: 2.5 days until [3094125](https://www.beaconcha.in/block/3094125)
- Machine: Hetzner
- Run config: Docker
- Lodestar version: v0.37.0.beta.0

#	Avg (12h)	Median (12h)	p95 (12h)	p99 (12h)	% > 1s	% > 5s
Node with 918 validators	13ms - 17ms	6.5 - 7.1ms	87ms - 92ms	560ms - 730ms	0.14% - 0.25%	0%

twoeths · 2022-05-26T12:12:44Z

Metric name	Query
Avg	rate(vc_rest_api_client_request_time_seconds_sum{routeId="produceAttestationData"}[12h]) / rate(vc_rest_api_client_request_time_seconds_count{routeId="produceAttestationData"}[12h])
Median	histogram_quantile(0.5, sum(rate(vc_rest_api_client_request_time_seconds_bucket{routeId="produceAttestationData"}[12h])) by (le))
p95	histogram_quantile(0.95, sum(rate(vc_rest_api_client_request_time_seconds_bucket{routeId="produceAttestationData"}[12h])) by (le))
p99	histogram_quantile(0.99, sum(rate(vc_rest_api_client_request_time_seconds_bucket{routeId="produceAttestationData"}[12h])) by (le))
% > 1s	1 - (sum(rate(vc_rest_api_client_request_time_seconds_bucket{le="1", routeId="produceAttestationData"}[12h])) by (job) / sum(rate(vc_rest_api_client_request_time_seconds_count{routeId="produceAttestationData"}[12h])) by (job))
% > 5s	1 - (sum(rate(vc_rest_api_client_request_time_seconds_bucket{le="5", routeId="produceAttestationData"}[12h])) by (job) / sum(rate(vc_rest_api_client_request_time_seconds_count{routeId="produceAttestationData"}[12h])) by (job))

twoeths · 2022-06-23T09:24:17Z

Right now we use sync apis in some places which block the event loop

Uncompress gossipsub messages, see Consider using snappy along with snappyjs #4170
Encrypt/decrypt libp2p connection in libp2p-noise, see libp2p-noise handshake performance #4140
gossipsub heartbeat, need to follow its performance heartbeat performance issue js-libp2p-gossipsub#256

twoeths · 2023-10-19T13:13:46Z

As of Oct 2023, we already implemented network thread and batch attestation validation so this is not an issue anymore

dapplion added the Epic Issues used as milestones and tracking multiple issues. label May 10, 2022

This was referenced May 10, 2022

Long block proposing times #3442

Closed

altair-devnet-2: Invalid signature SignedAggregateAndProof at the fork slot #2908

Closed

This was referenced May 12, 2022

Validator does not submit attestations right after chain head comes #3911

Closed

Timeout publishing unaggregated attestation #3922

Closed

dapplion mentioned this issue May 12, 2022

Beacon HTTP API slow response times #3694

Closed

twoeths self-assigned this May 20, 2022

philknows added the prio-high Resolve issues as soon as possible. label May 20, 2022

twoeths mentioned this issue May 23, 2022

[PROD] Attestation is not included in any AggregateAndProof #4045

Closed

This was referenced May 27, 2022

Use async version of crypto.randomBytes() #4075

Merged

Persist invalid SSZ objects only if enabled #3797

Merged

heartbeat performance issue ChainSafe/js-libp2p-gossipsub#256

Closed

dapplion removed the prio-high Resolve issues as soon as possible. label Jun 29, 2022

dapplion mentioned this issue Jun 29, 2022

Performance Optimizations Tracker #3466

Closed

26 tasks

dapplion added the scope-performance Performance issue and ideas to improve performance. label Jun 29, 2022

philknows mentioned this issue Oct 4, 2022

Missed attestation after v1.1.0 #4600

Closed

philknows mentioned this issue Dec 15, 2022

[vc] Investigate late attestations published on mainnet #4881

Closed

twoeths mentioned this issue Dec 22, 2022

Mock validator client #4932

Closed

twoeths mentioned this issue Mar 1, 2023

Pull gossip queues for better throughput #5195

Merged

twoeths closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme I/O lag on overloaded nodes #4002

Extreme I/O lag on overloaded nodes #4002

dapplion commented May 10, 2022 •

edited by twoeths

Loading

twoeths commented May 22, 2022 •

edited

Loading

twoeths commented May 26, 2022 •

edited by dapplion

Loading

twoeths commented Jun 23, 2022

twoeths commented Oct 19, 2023

Extreme I/O lag on overloaded nodes #4002

Extreme I/O lag on overloaded nodes #4002

Comments

dapplion commented May 10, 2022 • edited by twoeths Loading

twoeths commented May 22, 2022 • edited Loading

Contabo machines

Hetzner machines

twoeths commented May 26, 2022 • edited by dapplion Loading

twoeths commented Jun 23, 2022

twoeths commented Oct 19, 2023

dapplion commented May 10, 2022 •

edited by twoeths

Loading

twoeths commented May 22, 2022 •

edited

Loading

twoeths commented May 26, 2022 •

edited by dapplion

Loading