Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: epilog should prevent clean event #5345

Closed
jameshcorbett opened this issue Jul 19, 2023 · 4 comments · Fixed by #5353
Closed

job-manager: epilog should prevent clean event #5345

jameshcorbett opened this issue Jul 19, 2023 · 4 comments · Fixed by #5353
Assignees

Comments

@jameshcorbett
Copy link
Member

As soon as a rabbit job is submitted, flux-coral2 plugins create some external resources for it. When the job reaches cleanup state, a flux-coral2 jobtap plugin places an epilog on the job to clean up those resources. The epilog-start initiates the cleanup, the epilog-remove indicates that the cleanup has completed. However, when the job never has resources allocated to it (because of an exception), the epilog doesn't prevent the job from moving to inactive state, and the jobtap plugin hits an error when it tries to remove the epilog.

On the exception path, the epilog isn't strictly necessary, and I could add logic to the plugins to avoid placing the epilog at all (even though the external cleanup would still be required, we could do it asynchronously). But it would be convenient to be able to handle both cases the same way.

When an exception occurs, the eventlog may look like:

1689787285.137467 epilog-start description="dws-epilog"
1689787285.137490 clean

whereas through the normal path it looks like

1689787300.987355 start
1689787301.106616 finish status=0
1689787301.106683 epilog-start description="dws-epilog"
1689787301.111610 release ranks="all" final=true
1689787315.361298 epilog-finish description="dws-epilog" status=0
1689787315.366933 free
1689787315.367069 clean
@garlick
Copy link
Member

garlick commented Jul 19, 2023

I wonder if #5321 (merged last week) helps here?

@jameshcorbett
Copy link
Member Author

I think that's a separate issue, about a specific plugin, and not epilog behavior in general.

@garlick
Copy link
Member

garlick commented Jul 19, 2023

I did notice that the dws plugin is calling flux_jobtap_prolog_start() from the job.state.cleanup callback, while perilog calls it from job.event.finish. It does seem like either ought to work but perhaps changing it would be a quick fix, since perilog is pretty well tested (in fact I just tested that case on my test cluster).

1689793200.125807 submit userid=5588 urgency=16 flags=0 version=1
1689793200.139915 validate
1689793200.153064 depend
1689793200.153162 priority priority=100000
1689793200.695030 alloc
1689793200.695479 prolog-start description="job-manager.prolog"
1689793200.695517 prolog-start description="cray-pals-port-distributor"
1689793200.710578 prolog-finish description="cray-pals-port-distributor" status=0
1689793203.183666 prolog-finish description="job-manager.prolog" status=0
1689793203.223227 start
1689793204.227164 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1689793204.353298 finish status=36352
1689793204.353770 epilog-start description="job-manager.epilog"
1689793204.385355 release ranks="all" final=true
1689793206.876997 epilog-finish description="job-manager.epilog" status=0
1689793206.877812 free
1689793206.877893 clean

See also: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html

@grondo
Copy link
Contributor

grondo commented Jul 19, 2023

As discussed with @jameshcorbett offline, currently an outstanding epilog-start event just prevents the free request from being issued to the scheduler. If there is no pending free request, i.e. the free event does not need to be emitted for the job to proceed to inactive, then an epilog-start event has no effect and the clean event is emitted.

As noted in the issue title, an epilog-start event should hold up the clean event as well as preventing a free request to the scheduler. This will allow "epilogs" to be started even for jobs that didn't have resources allocated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants