You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a rabbit job finishes, an epilog is placed on them while data movement completes, because the job should not be considered finished until data has been moved. However, that epilog also prevents the jobs from being canceled and moving to INACTIVE. This is intentional because I thought users might not want to destroy their data just because they canceled their job.
@bdevcich has found that sometimes workflows get stuck in the DataOut stage because DCP hangs. When that happens, it requires manual intervention from an admin with kubernetes write access, to move the workflow manually to the Teardown state. That is bad.
I see two options here, to ensure manual intervention is never needed:
If a cancel event is received while the job is in the data movement epilog, somehow interpret that as meaning "forget data movement, just get rid of the job" and move the workflow to Teardown.
Add a separate utility, like flux rabbit-cancel, that moves a workflow to Teardown.
After a discussion in the weekly rabbit call, it seems the consensus is to change the behavior of an ordinary cancel to make Flux terminate data movement.
After a rabbit job finishes, an epilog is placed on them while data movement completes, because the job should not be considered finished until data has been moved. However, that epilog also prevents the jobs from being canceled and moving to INACTIVE. This is intentional because I thought users might not want to destroy their data just because they canceled their job.
@bdevcich has found that sometimes workflows get stuck in the
DataOut
stage because DCP hangs. When that happens, it requires manual intervention from an admin with kubernetes write access, to move the workflow manually to theTeardown
state. That is bad.I see two options here, to ensure manual intervention is never needed:
cancel
event is received while the job is in the data movement epilog, somehow interpret that as meaning "forget data movement, just get rid of the job" and move the workflow toTeardown
.flux rabbit-cancel
, that moves a workflow toTeardown
.@grondo do you have any thoughts?
The text was updated successfully, but these errors were encountered: