Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node is drained if job is canceled during epilog #5319

Closed
garlick opened this issue Jul 7, 2023 · 3 comments
Closed

node is drained if job is canceled during epilog #5319

garlick opened this issue Jul 7, 2023 · 3 comments

Comments

@garlick
Copy link
Member

garlick commented Jul 7, 2023

Problem: t2304-sched-simple-alloc-check.t logs the following during cleanup:

flux-cancel: Canceled 1 job (0 errors)
Jul 07 18:51:31.221338 job-manager.err[0]: f3YA1iNB: epilog: stderr: f3YA1iNB: rank 0 failed epilog, draining

I had been ignoring this as an as-designed feature, but thought maybe I'd better get the opinion of others in case this is a potential gotcha on a production cluster.

@grondo
Copy link
Contributor

grondo commented Jul 7, 2023

Hm, that doesn't seem right. We should probably look into it.

Is the instance exiting at this point? Doesn't the broker kill all subprocesses at exit, so perhaps this is what we're seeing? Or has this reproduced with a non-exiting instance?

@garlick
Copy link
Member Author

garlick commented Jul 7, 2023

It reproduces with non-exiting. The test added in flux-framework/flux-sched#1044 cancels jobs in the middle and has to undrain in order to continue.

@grondo
Copy link
Contributor

grondo commented Jul 8, 2023

Oh, oops. The perilog module terminates the job prolog on a job exception, but the handler doesn't ensure it is the prolog that is running and not the epilog.

diff --git a/src/modules/job-manager/plugins/perilog.c b/src/modules/job-manager/plugins/perilog.c
index dd8cbc400..2b8373b23 100644
--- a/src/modules/job-manager/plugins/perilog.c
+++ b/src/modules/job-manager/plugins/perilog.c
@@ -460,6 +460,7 @@ static int exception_cb (flux_plugin_t *p,
     if ((proc = flux_jobtap_job_aux_get (p,
                                         FLUX_JOBTAP_CURRENT_JOB,
                                         "perilog_proc"))
+        && proc->prolog
         && flux_subprocess_state (proc->sp) == FLUX_SUBPROCESS_RUNNING) {
         if (prolog_kill (proc) < 0
             || prolog_kill_timer_start (proc,

I also notice the prolog will be terminated on any job exception, since the exception handler doesn't check the severity of the exception. I'll try to create a PR to fix both these issues. Sorry!

grondo added a commit to grondo/flux-core that referenced this issue Jul 9, 2023
Problem: The job-manager 'perilog' plugin terminates any running prolog
if the job receives an exception. This works as expected, but the code
doesn't check whether a running process is the prolog or epilog, so
job epilogs are killed on job exception as well. This isn't desired
behavior since this doesn't allow the epilog to properly complete,
and also causes the nodes on which the epilog is running to be drained.

Ensure that the running process is a prolog before killing it.

Fixes flux-framework#5319
grondo added a commit to grondo/flux-core that referenced this issue Jul 9, 2023
Problem: The job-manager 'perilog' plugin terminates any running prolog
if the job receives an exception. This works as expected, but the code
doesn't check whether a running process is the prolog or epilog, so
job epilogs are killed on job exception as well. This isn't desired
behavior since this doesn't allow the epilog to properly complete,
and also causes the nodes on which the epilog is running to be drained.

Ensure that the running process is a prolog before killing it.

Fixes flux-framework#5319
grondo added a commit to grondo/flux-sched that referenced this issue Jul 9, 2023
Problem: Once flux-framework/flux-core#5319 is fixed, broker
ranks will not be drained if the epilog is running when a job is
canceled. However, this causes t1024-alloc-check.t to fail since
`flux resource undrain 0` will fail.

Don't fail if the undrain of rank 0 fails, so that this test will
pass even after fixing the flux-core bug.
grondo added a commit to grondo/flux-core that referenced this issue Jul 9, 2023
Problem: The job-manager 'perilog' plugin terminates any running prolog
if the job receives an exception. This works as expected, but the code
doesn't check whether a running process is the prolog or epilog, so
job epilogs are killed on job exception as well. This isn't desired
behavior since this doesn't allow the epilog to properly complete,
and also causes the nodes on which the epilog is running to be drained.

Ensure that the running process is a prolog before killing it.

Fixes flux-framework#5319
grondo added a commit to grondo/flux-sched that referenced this issue Jul 9, 2023
Problem: Once flux-framework/flux-core#5319 is fixed, broker
ranks will not be drained if the epilog is running when a job is
canceled. However, this causes t1024-alloc-check.t to fail since
`flux resource undrain 0` will fail.

Don't fail if the undrain of rank 0 fails, so that this test will
pass even after fixing the flux-core bug.
@mergify mergify bot closed this as completed in b41f9eb Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants