Allow checkpointer to immediately preempt and requeue when receiving a signal #16

f-dangel · 2024-09-12T18:31:41Z

We could introduce a flag preempt_immediately=False to Checkpointer. If set to True, upon receiving the signal, the checkpointer will immediately mark the run as preempted and requeue the job, without saving a checkpoint. This would be useful for clusters that do not have any guarantee for a preemption-free initial period.

In this case, the responsibility to properly sync the Weights & Biases logs with the logs from the resumed run is up to the user. But we can have an example tutorial to demonstrate how it works (I think one would do this by storing the wandb step in the checkpoint using the extra_info option in .step, which would then be retrieved with .load_latest_checkpoint).

The text was updated successfully, but these errors were encountered:

f-dangel added the enhancement New feature or request label Sep 12, 2024

This was referenced Sep 13, 2024

[ENH] Mark wandb run as preempting as soon as signal received #17

Open

[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow checkpointer to immediately preempt and requeue when receiving a signal #16

Allow checkpointer to immediately preempt and requeue when receiving a signal #16

f-dangel commented Sep 12, 2024 •

edited

Loading

Allow checkpointer to immediately preempt and requeue when receiving a signal #16

Allow checkpointer to immediately preempt and requeue when receiving a signal #16

Comments

f-dangel commented Sep 12, 2024 • edited Loading

f-dangel commented Sep 12, 2024 •

edited

Loading