-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node is drained if job is canceled during epilog #5319
Comments
Hm, that doesn't seem right. We should probably look into it. Is the instance exiting at this point? Doesn't the broker kill all subprocesses at exit, so perhaps this is what we're seeing? Or has this reproduced with a non-exiting instance? |
It reproduces with non-exiting. The test added in flux-framework/flux-sched#1044 cancels jobs in the middle and has to undrain in order to continue. |
Oh, oops. The perilog module terminates the job prolog on a job exception, but the handler doesn't ensure it is the prolog that is running and not the epilog. diff --git a/src/modules/job-manager/plugins/perilog.c b/src/modules/job-manager/plugins/perilog.c
index dd8cbc400..2b8373b23 100644
--- a/src/modules/job-manager/plugins/perilog.c
+++ b/src/modules/job-manager/plugins/perilog.c
@@ -460,6 +460,7 @@ static int exception_cb (flux_plugin_t *p,
if ((proc = flux_jobtap_job_aux_get (p,
FLUX_JOBTAP_CURRENT_JOB,
"perilog_proc"))
+ && proc->prolog
&& flux_subprocess_state (proc->sp) == FLUX_SUBPROCESS_RUNNING) {
if (prolog_kill (proc) < 0
|| prolog_kill_timer_start (proc, I also notice the prolog will be terminated on any job exception, since the exception handler doesn't check the severity of the exception. I'll try to create a PR to fix both these issues. Sorry! |
Problem: The job-manager 'perilog' plugin terminates any running prolog if the job receives an exception. This works as expected, but the code doesn't check whether a running process is the prolog or epilog, so job epilogs are killed on job exception as well. This isn't desired behavior since this doesn't allow the epilog to properly complete, and also causes the nodes on which the epilog is running to be drained. Ensure that the running process is a prolog before killing it. Fixes flux-framework#5319
Problem: The job-manager 'perilog' plugin terminates any running prolog if the job receives an exception. This works as expected, but the code doesn't check whether a running process is the prolog or epilog, so job epilogs are killed on job exception as well. This isn't desired behavior since this doesn't allow the epilog to properly complete, and also causes the nodes on which the epilog is running to be drained. Ensure that the running process is a prolog before killing it. Fixes flux-framework#5319
Problem: Once flux-framework/flux-core#5319 is fixed, broker ranks will not be drained if the epilog is running when a job is canceled. However, this causes t1024-alloc-check.t to fail since `flux resource undrain 0` will fail. Don't fail if the undrain of rank 0 fails, so that this test will pass even after fixing the flux-core bug.
Problem: The job-manager 'perilog' plugin terminates any running prolog if the job receives an exception. This works as expected, but the code doesn't check whether a running process is the prolog or epilog, so job epilogs are killed on job exception as well. This isn't desired behavior since this doesn't allow the epilog to properly complete, and also causes the nodes on which the epilog is running to be drained. Ensure that the running process is a prolog before killing it. Fixes flux-framework#5319
Problem: Once flux-framework/flux-core#5319 is fixed, broker ranks will not be drained if the epilog is running when a job is canceled. However, this causes t1024-alloc-check.t to fail since `flux resource undrain 0` will fail. Don't fail if the undrain of rank 0 fails, so that this test will pass even after fixing the flux-core bug.
Problem:
t2304-sched-simple-alloc-check.t
logs the following during cleanup:I had been ignoring this as an as-designed feature, but thought maybe I'd better get the opinion of others in case this is a potential gotcha on a production cluster.
The text was updated successfully, but these errors were encountered: