node is drained if job is canceled during epilog #5319

garlick · 2023-07-07T18:54:11Z

Problem: t2304-sched-simple-alloc-check.t logs the following during cleanup:

flux-cancel: Canceled 1 job (0 errors)
Jul 07 18:51:31.221338 job-manager.err[0]: f3YA1iNB: epilog: stderr: f3YA1iNB: rank 0 failed epilog, draining

I had been ignoring this as an as-designed feature, but thought maybe I'd better get the opinion of others in case this is a potential gotcha on a production cluster.

The text was updated successfully, but these errors were encountered:

grondo · 2023-07-07T21:37:38Z

Hm, that doesn't seem right. We should probably look into it.

Is the instance exiting at this point? Doesn't the broker kill all subprocesses at exit, so perhaps this is what we're seeing? Or has this reproduced with a non-exiting instance?

garlick · 2023-07-07T22:03:47Z

It reproduces with non-exiting. The test added in flux-framework/flux-sched#1044 cancels jobs in the middle and has to undrain in order to continue.

grondo · 2023-07-08T03:27:06Z

Oh, oops. The perilog module terminates the job prolog on a job exception, but the handler doesn't ensure it is the prolog that is running and not the epilog.

diff --git a/src/modules/job-manager/plugins/perilog.c b/src/modules/job-manager/plugins/perilog.c
index dd8cbc400..2b8373b23 100644
--- a/src/modules/job-manager/plugins/perilog.c
+++ b/src/modules/job-manager/plugins/perilog.c
@@ -460,6 +460,7 @@ static int exception_cb (flux_plugin_t *p,
     if ((proc = flux_jobtap_job_aux_get (p,
                                         FLUX_JOBTAP_CURRENT_JOB,
                                         "perilog_proc"))
+        && proc->prolog
         && flux_subprocess_state (proc->sp) == FLUX_SUBPROCESS_RUNNING) {
         if (prolog_kill (proc) < 0
             || prolog_kill_timer_start (proc,

I also notice the prolog will be terminated on any job exception, since the exception handler doesn't check the severity of the exception. I'll try to create a PR to fix both these issues. Sorry!

Problem: The job-manager 'perilog' plugin terminates any running prolog if the job receives an exception. This works as expected, but the code doesn't check whether a running process is the prolog or epilog, so job epilogs are killed on job exception as well. This isn't desired behavior since this doesn't allow the epilog to properly complete, and also causes the nodes on which the epilog is running to be drained. Ensure that the running process is a prolog before killing it. Fixes flux-framework#5319

Problem: Once flux-framework/flux-core#5319 is fixed, broker ranks will not be drained if the epilog is running when a job is canceled. However, this causes t1024-alloc-check.t to fail since `flux resource undrain 0` will fail. Don't fail if the undrain of rank 0 fails, so that this test will pass even after fixing the flux-core bug.

Problem: The job-manager 'perilog' plugin terminates any running prolog if the job receives an exception. This works as expected, but the code doesn't check whether a running process is the prolog or epilog, so job epilogs are killed on job exception as well. This isn't desired behavior since this doesn't allow the epilog to properly complete, and also causes the nodes on which the epilog is running to be drained. Ensure that the running process is a prolog before killing it. Fixes flux-framework#5319

Problem: Once flux-framework/flux-core#5319 is fixed, broker ranks will not be drained if the epilog is running when a job is canceled. However, this causes t1024-alloc-check.t to fail since `flux resource undrain 0` will fail. Don't fail if the undrain of rank 0 fails, so that this test will pass even after fixing the flux-core bug.

grondo mentioned this issue Jul 9, 2023

testsuite: t1024-alloc-check.t: do not fail if rank 0 not drained flux-framework/flux-sched#1047

Merged

mergify bot closed this as completed in b41f9eb Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node is drained if job is canceled during epilog #5319

node is drained if job is canceled during epilog #5319

garlick commented Jul 7, 2023

grondo commented Jul 7, 2023

garlick commented Jul 7, 2023

grondo commented Jul 8, 2023

node is drained if job is canceled during epilog #5319

node is drained if job is canceled during epilog #5319

Comments

garlick commented Jul 7, 2023

grondo commented Jul 7, 2023

garlick commented Jul 7, 2023

grondo commented Jul 8, 2023