Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow notebook not available #285

Open
andriineronov opened this issue Jan 2, 2025 · 7 comments
Open

Workflow notebook not available #285

andriineronov opened this issue Jan 2, 2025 · 7 comments
Assignees

Comments

@andriineronov
Copy link

I was running IACT simulators (MAGIC) and wanted to check the noetebook progress, got this:

image

Not sure if this is a transient issue or there is a problem.....

@burnout87
Copy link
Collaborator

Hi,

Could you re-try by re-issuing the same request but with a slightly modified parameter ? any is fine, as long as a new job_id is generated

@andriineronov
Copy link
Author

Here is another try (with the job ID and details of the request:

tmp.pdf

@burnout87
Copy link
Collaborator

Did it work @andriineronov ?

@dsavchenko
Copy link
Member

dsavchenko commented Jan 2, 2025

After investigating with @burnout87 we found two problems involved here:

  1. The workflow resource requirements. I was able to reproduce an issue with "medium" resources in renku. Kernel dies during execution of cell 11.
    @andriineronov we can try to localize the actual command which requires excessive memory (UPD: looks like it's actually due to CPU usage) and think of optimisations.

  2. The response of the nb2workflow doesn't contain jobdir, that's why dispatcher can't get unfinished notebook. With normal exceptions, raised inside notebook, we always have it in the response. Killed kernel is somehow special in this regard. Actions to do:

  • treat on the dispatcher level: at least send sentry issue in case jobdir is missing
  • further investigate how this occurs in the backend, try to mimic the test case and then fix

@dsavchenko
Copy link
Member

  1. Kernel dies during execution of cell 11.

Precisely, it dies on MapDatasetMaker.run() which leads to 100% CPU utilisation (and the resources are limited at 1 CPU) so the process gets killed by k8s.

I currently have no idea on how to deal with it. Even if there is no multiprocessing involved (need to verify in the gammapy code, if it is), there are other processes in the container, so CPU usage is effectively over 1. Should we just increase pod resource limits a bit? Or there are any other techniques available I'm ignorant of?

@volodymyrss
Copy link
Member

  1. Kernel dies during execution of cell 11.

Precisely, it dies on MapDatasetMaker.run() which leads to 100% CPU utilisation (and the resources are limited at 1 CPU) so the process gets killed by k8s.

I currently have no idea on how to deal with it. Even if there is no multiprocessing involved (need to verify in the gammapy code, if it is), there are other processes in the container, so CPU usage is effectively over 1. Should we just increase pod resource limits a bit? Or there are any other techniques available I'm ignorant of?

I think we should increase pod resource limits. There might be multiprocessing sometimes as well. Ideally, it should be read from notebook annotations. But the default should be also probably slightly more than one. (resource requests might remain lower)

@andriineronov
Copy link
Author

Just in case, I currently removed LST1_events from IACT simulators, so that people who would check MAGIC simulator after Julian's e-mail do not click on something that is not working properly. Also, I re-did LST1_events under the LST tab, using the package of Ievgen, I am not sure what is he using inisde, but it does not fail, neither on production nor on staging version (you can check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants