You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could introduce a flag preempt_immediately=False to Checkpointer. If set to True, upon receiving the signal, the checkpointer will immediately mark the run as preempted and requeue the job, without saving a checkpoint. This would be useful for clusters that do not have any guarantee for a preemption-free initial period.
In this case, the responsibility to properly sync the Weights & Biases logs with the logs from the resumed run is up to the user. But we can have an example tutorial to demonstrate how it works (I think one would do this by storing the wandb step in the checkpoint using the extra_info option in .step, which would then be retrieved with .load_latest_checkpoint).
The text was updated successfully, but these errors were encountered:
We could introduce a flag
preempt_immediately=False
toCheckpointer
. If set toTrue
, upon receiving the signal, the checkpointer will immediately mark the run as preempted and requeue the job, without saving a checkpoint. This would be useful for clusters that do not have any guarantee for a preemption-free initial period.In this case, the responsibility to properly sync the Weights & Biases logs with the logs from the resumed run is up to the user. But we can have an example tutorial to demonstrate how it works (I think one would do this by storing the wandb step in the checkpoint using the
extra_info
option in.step
, which would then be retrieved with.load_latest_checkpoint
).The text was updated successfully, but these errors were encountered: