Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: fix prolog/epilog exception handling #5321

Merged
merged 3 commits into from
Jul 10, 2023

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jul 9, 2023

This PR fixes a couple issues in the handling of exceptions in the perilog plugin:

  • Only kill the prolog when an exception is received and the prolog is running. Ignore exceptions during the epilog.
  • Only kill the prolog for fatal job exceptions, i.e. ignore all job exceptions with severity > 0.

This will require flux-framework/flux-sched#1047 to be merged so that the flux-sched checks won't fail.

Problem: The job-manager 'perilog' plugin terminates any running prolog
if the job receives an exception. This works as expected, but the code
doesn't check whether a running process is the prolog or epilog, so
job epilogs are killed on job exception as well. This isn't desired
behavior since this doesn't allow the epilog to properly complete,
and also causes the nodes on which the epilog is running to be drained.

Ensure that the running process is a prolog before killing it.

Fixes flux-framework#5319
Problem: The perilog plugin terminates a running prolog on any job
exception, but it should only do so for a fatal job exception.

Check the job exception severity in the exception event callback and
only terminate the prolog for exceptions with severity=0.
Problem: No tests in the testsuite ensure that a nonfatal exception
during the job prolog does not terminate the prolog.

Add a test to check this expected behavior.
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks.

@grondo
Copy link
Contributor Author

grondo commented Jul 10, 2023

Thanks! Setting MWP.

@mergify mergify bot merged commit 3ebe807 into flux-framework:master Jul 10, 2023
@grondo grondo deleted the issue#5319 branch July 10, 2023 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants