Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve worker event loop lag monitoring #6720

Open
2 tasks
jeluard opened this issue May 1, 2024 · 7 comments
Open
2 tasks

Improve worker event loop lag monitoring #6720

jeluard opened this issue May 1, 2024 · 7 comments
Assignees
Labels
scope-performance Performance issue and ideas to improve performance.

Comments

@jeluard
Copy link
Contributor

jeluard commented May 1, 2024

collectNodeJSMetrics relies on prom-client to collect eventLoopMonitoring. This in turns relies on node:perf_hooks/monitorEventLoopDelay that only works for the main thread or worker from the cluster module (as documented here).
woker_threads support is unclear, although we relies on this for worker_threads monitoring (network and discv5).

Proposed action items

Improve event loop monitoring, especially for workers. It looks preferable not to rely on prom-client for this specific monitoring and add support directly in lodestar.

@jeluard jeluard added the scope-performance Performance issue and ideas to improve performance. label May 1, 2024
@jeluard jeluard self-assigned this May 1, 2024
@nflaig
Copy link
Member

nflaig commented May 1, 2024

node:perf_hooks that only works for the main thread

I would expect perf hooks to work inside worker threads, where did you read this?

@wemeetagain
Copy link
Member

Specifically, woker_threads support is yet missing- siimon/prom-client#401
although we incorrectly relies on [prom-client] for workers monitoring

This isn't true, not in the way its suggested here. prom-client doesn't automagically instrument workers. We just do it manually, so we are not incorrectly relying prom-client.

The PR linked in that issue does this too. It has the main thread communicate with the worker to get data from the worker and include it in the metric registry of the main thread.

This is exactly what we do in lodestar. See https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/node/nodejs.ts#L277

@jeluard
Copy link
Contributor Author

jeluard commented May 2, 2024

worker_threads defines eventLoopUtilization in addition to the perf_hooks one, with this comment:

The same call as `perf_hooks eventLoopUtilization`, except the values of the worker instance are returned.

Its implementation itself required low-level changes. There is no equivalent to worker_threads for monitorEventLoopDelay so I would lean towards thinking it reports main thread info when called from within a worker. I couldn't find conclusive info on this though.

My main feeling is that eventLoopUtilization provides extra interesting information, and explicitly supports worker_threads.
I will update this PR so that to clarify this and correct imprecisions.

@nflaig
Copy link
Member

nflaig commented May 2, 2024

I would lean towards thinking it reports main thread info when called from within a worker

This would be really bad if nodejs just does this without having a note in the docs, which they usually do a great job of noting limitations. This is really simple to reproduce as well, if this is true we might also wanna report this upstream to node.

Based on previous data #5604, the event loop lag seems to be reported correctly.

@wemeetagain
Copy link
Member

eventLoopUtilization ... explicitly supports worker_threads

This is another case where "supports worker_threads" means "automagically handles communication between a worker and main thread".

Would agree with @nflaig that based on the data we've seen and used in the past, we can assume that monitorEventLoopDelay works properly in workers (just that the data must be manually communicated to the main thread).

eventLoopUtilization provides extra interesting information

Yeah, agree there, could be worth adding as an additional metric

@jeluard
Copy link
Contributor Author

jeluard commented May 2, 2024

This is another case where "supports worker_threads" means "automagically handles communication between a worker and main thread".

By this I mean that there are 2 distinct API, performance.eventLoopUtilization and worker.performance.eventLoopUtilization
Both can be accessed from the main thread (forworker.performance.eventLoopUtilization, via the worker reference) and worker threads.

@wemeetagain
Copy link
Member

Both can be accessed from the main thread

I don't think we want to use this feature. (it gets more complicated when workers spawn other workers, like we currently do)

Rather we can just use the current pattern of having each worker collect its own metrics (and as a byproduct, reuse the existing mechanism for communicating those to the reporting thread).

Recommend appending new metrics collection here, which will automatically be applied to all our workers: https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/metrics/nodeJsMetrics.ts

@jeluard jeluard changed the title Incorrect worker event loop lag monitoring Improve worker event loop lag monitoring May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope-performance Performance issue and ideas to improve performance.
Projects
None yet
Development

No branches or pull requests

3 participants