Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to force-cancel rabbit data movement #255

Closed
jameshcorbett opened this issue Jan 28, 2025 · 1 comment · Fixed by #263
Closed

Add a way to force-cancel rabbit data movement #255

jameshcorbett opened this issue Jan 28, 2025 · 1 comment · Fixed by #263

Comments

@jameshcorbett
Copy link
Member

After a rabbit job finishes, an epilog is placed on them while data movement completes, because the job should not be considered finished until data has been moved. However, that epilog also prevents the jobs from being canceled and moving to INACTIVE. This is intentional because I thought users might not want to destroy their data just because they canceled their job.

@bdevcich has found that sometimes workflows get stuck in the DataOut stage because DCP hangs. When that happens, it requires manual intervention from an admin with kubernetes write access, to move the workflow manually to the Teardown state. That is bad.

I see two options here, to ensure manual intervention is never needed:

  1. If a cancel event is received while the job is in the data movement epilog, somehow interpret that as meaning "forget data movement, just get rid of the job" and move the workflow to Teardown.
  2. Add a separate utility, like flux rabbit-cancel, that moves a workflow to Teardown.

@grondo do you have any thoughts?

@jameshcorbett
Copy link
Member Author

After a discussion in the weekly rabbit call, it seems the consensus is to change the behavior of an ordinary cancel to make Flux terminate data movement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant