Workflow notebook not available #285

andriineronov · 2025-01-02T09:31:40Z

I was running IACT simulators (MAGIC) and wanted to check the noetebook progress, got this:

Not sure if this is a transient issue or there is a problem.....

burnout87 · 2025-01-02T09:35:50Z

Hi,

Could you re-try by re-issuing the same request but with a slightly modified parameter ? any is fine, as long as a new job_id is generated

andriineronov · 2025-01-02T09:44:33Z

Here is another try (with the job ID and details of the request:

tmp.pdf

burnout87 · 2025-01-02T11:24:35Z

Did it work @andriineronov ?

dsavchenko · 2025-01-02T17:09:15Z

After investigating with @burnout87 we found two problems involved here:

The workflow resource requirements. I was able to reproduce an issue with "medium" resources in renku. Kernel dies during execution of cell 11.
@andriineronov we can try to localize the actual command which requires excessive memory (UPD: looks like it's actually due to CPU usage) and think of optimisations.
The response of the nb2workflow doesn't contain jobdir, that's why dispatcher can't get unfinished notebook. With normal exceptions, raised inside notebook, we always have it in the response. Killed kernel is somehow special in this regard. Actions to do:

treat on the dispatcher level: at least send sentry issue in case jobdir is missing
further investigate how this occurs in the backend, try to mimic the test case and then fix

dsavchenko · 2025-01-03T08:53:54Z

Kernel dies during execution of cell 11.

Precisely, it dies on MapDatasetMaker.run() which leads to 100% CPU utilisation (and the resources are limited at 1 CPU) so the process gets killed by k8s.

I currently have no idea on how to deal with it. Even if there is no multiprocessing involved (need to verify in the gammapy code, if it is), there are other processes in the container, so CPU usage is effectively over 1. Should we just increase pod resource limits a bit? Or there are any other techniques available I'm ignorant of?

volodymyrss · 2025-01-04T13:17:57Z

Kernel dies during execution of cell 11.

Precisely, it dies on MapDatasetMaker.run() which leads to 100% CPU utilisation (and the resources are limited at 1 CPU) so the process gets killed by k8s.

I currently have no idea on how to deal with it. Even if there is no multiprocessing involved (need to verify in the gammapy code, if it is), there are other processes in the container, so CPU usage is effectively over 1. Should we just increase pod resource limits a bit? Or there are any other techniques available I'm ignorant of?

I think we should increase pod resource limits. There might be multiprocessing sometimes as well. Ideally, it should be read from notebook annotations. But the default should be also probably slightly more than one. (resource requests might remain lower)

andriineronov · 2025-01-04T14:16:19Z

Just in case, I currently removed LST1_events from IACT simulators, so that people who would check MAGIC simulator after Julian's e-mail do not click on something that is not working properly. Also, I re-did LST1_events under the LST tab, using the package of Ievgen, I am not sure what is he using inisde, but it does not fail, neither on production nor on staging version (you can check).

andriineronov assigned volodymyrss and dsavchenko Jan 2, 2025

volodymyrss assigned burnout87 Jan 2, 2025

burnout87 mentioned this issue Jan 2, 2025

handling case no jobdir oda-hub/dispatcher-plugin-nb2workflow#122

Merged

dsavchenko mentioned this issue Jan 2, 2025

Properly treat DeadKernelError oda-hub/nb2workflow#222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow notebook not available #285

Workflow notebook not available #285

andriineronov commented Jan 2, 2025

burnout87 commented Jan 2, 2025

andriineronov commented Jan 2, 2025

burnout87 commented Jan 2, 2025

dsavchenko commented Jan 2, 2025 •

edited

Loading

dsavchenko commented Jan 3, 2025

volodymyrss commented Jan 4, 2025

andriineronov commented Jan 4, 2025

Workflow notebook not available #285

Workflow notebook not available #285

Comments

andriineronov commented Jan 2, 2025

burnout87 commented Jan 2, 2025

andriineronov commented Jan 2, 2025

burnout87 commented Jan 2, 2025

dsavchenko commented Jan 2, 2025 • edited Loading

dsavchenko commented Jan 3, 2025

volodymyrss commented Jan 4, 2025

andriineronov commented Jan 4, 2025

dsavchenko commented Jan 2, 2025 •

edited

Loading