Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Worker Monitoring #608

Open
josephjclark opened this issue Feb 21, 2024 · 1 comment
Open

Epic: Worker Monitoring #608

josephjclark opened this issue Feb 21, 2024 · 1 comment
Assignees
Labels

Comments

@josephjclark
Copy link
Collaborator

josephjclark commented Feb 21, 2024

An epic issue to have oversight over monitoring on the worker.

The high level brief is: we need better visibility of what's going on inside the worker, especially when things go wrong.

We should consider metrics tracking, sentry reporting, email notification, grafana, etc.

Related:

#603
#402

Things we want

  • When a worker claims something (we do this already in Lightning, but will be useful later to track drift between what lightning thinks and what the worker knows).
  • When a worker has to kill a run or job.
  • Memory sampling for:
    • At least the whole process tree
    • engine
    • workers? (will need to think about how useful this is on it's own since they are 'disposable processes' and picking them out of a crowd in monitoring may not be that useful).
  • CPU usage? (might just be solved by monitoring the pod directly)

We need to figure out the best approach for how to integrate this into prometheus, do we expose an aggregate http service (or use lightning for that) that collects up the metrics?

We probably don't want to use service discovery for monitoring? Do we?
There is an advantage of workers exposing their own /metrics server, makes the worker better for everyone.

@github-project-automation github-project-automation bot moved this to New Issues in v2 Feb 21, 2024
@christad92 christad92 moved this from New Issues to Icebox in v2 Feb 22, 2024
@josephjclark josephjclark removed their assignment Jun 19, 2024
@christad92 christad92 moved this from Icebox to Backlog in v2 Jul 4, 2024
@josephjclark
Copy link
Collaborator Author

This keeps coming up so I think we want to spend some time on it.

I think there are two seperate but related big issues right now:

  1. benchmarking: local tests on the worker performance. We want to better understand or current performance and how it scales. This also lets us verify that future improvements are helping
  2. Transparency: we need to better understand what the worker is doing in live environments. Does this mean more eventing? More logging? Can we have a live dashboard? Can we output performance metrics?

Some quick thoughts about possible performance bottlenecks:

  • adaptor installation and compilation are in the main thread. A worker which is compiling code cannot pick up new work. - - there is no compiler caching
  • we do want to move compilation into the thread, there's an issue around that
  • So actually tests on how large jobs (lots of compiler work) and maybe large inputs (lots of main thread json parsing) would be useful. How do those things affect compiler performance?

@josephjclark josephjclark self-assigned this Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

2 participants