Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow checkpointer to immediately preempt and requeue when receiving a signal #16

Open
f-dangel opened this issue Sep 12, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@f-dangel
Copy link
Owner

f-dangel commented Sep 12, 2024

We could introduce a flag preempt_immediately=False to Checkpointer. If set to True, upon receiving the signal, the checkpointer will immediately mark the run as preempted and requeue the job, without saving a checkpoint. This would be useful for clusters that do not have any guarantee for a preemption-free initial period.

In this case, the responsibility to properly sync the Weights & Biases logs with the logs from the resumed run is up to the user. But we can have an example tutorial to demonstrate how it works (I think one would do this by storing the wandb step in the checkpoint using the extra_info option in .step, which would then be retrieved with .load_latest_checkpoint).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant