You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a worker claims something (we do this already in Lightning, but will be useful later to track drift between what lightning thinks and what the worker knows).
When a worker has to kill a run or job.
Memory sampling for:
At least the whole process tree
engine
workers? (will need to think about how useful this is on it's own since they are 'disposable processes' and picking them out of a crowd in monitoring may not be that useful).
CPU usage? (might just be solved by monitoring the pod directly)
We need to figure out the best approach for how to integrate this into prometheus, do we expose an aggregate http service (or use lightning for that) that collects up the metrics?
We probably don't want to use service discovery for monitoring? Do we?
There is an advantage of workers exposing their own /metrics server, makes the worker better for everyone.
The text was updated successfully, but these errors were encountered:
This keeps coming up so I think we want to spend some time on it.
I think there are two seperate but related big issues right now:
benchmarking: local tests on the worker performance. We want to better understand or current performance and how it scales. This also lets us verify that future improvements are helping
Transparency: we need to better understand what the worker is doing in live environments. Does this mean more eventing? More logging? Can we have a live dashboard? Can we output performance metrics?
Some quick thoughts about possible performance bottlenecks:
adaptor installation and compilation are in the main thread. A worker which is compiling code cannot pick up new work. - - there is no compiler caching
we do want to move compilation into the thread, there's an issue around that
So actually tests on how large jobs (lots of compiler work) and maybe large inputs (lots of main thread json parsing) would be useful. How do those things affect compiler performance?
An epic issue to have oversight over monitoring on the worker.
The high level brief is: we need better visibility of what's going on inside the worker, especially when things go wrong.
We should consider metrics tracking, sentry reporting, email notification, grafana, etc.
Related:
#603
#402
Things we want
We need to figure out the best approach for how to integrate this into prometheus, do we expose an aggregate http service (or use lightning for that) that collects up the metrics?
We probably don't want to use service discovery for monitoring? Do we?
There is an advantage of workers exposing their own
/metrics
server, makes the worker better for everyone.The text was updated successfully, but these errors were encountered: