[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR aims to address #16 and is a DRAFT to discuss problems.
The goal is to allow the
Checkpointer
to pre-empt immediately when receiving a signal, without writing a new checkpoint (because we may not have enough grace time). This means there can be duplicate runs ifwandb.log
was called between the last call of.step
and the signal reception.wand.run.step
inside a checkpoint. We can then restore that step when loading the latest checkpoint. This PR currently does this using theextra_info
option, and I think we can discuss better ways once everything is working.resume_from=<run_id>?_step=<step>
inwandb.init
. This does not work out of the box and one must request this feature to be enabled from thewandb
support, see here.wandb.run.step
manually. This fails with anAttributeError: can't set attribute
Checkpointer
which, if enabled, will make it pre-empt and requeue immediately when receiving a signal.@scottclowe do let me know if you have other ideas how to achieve this and feel free to push to this branch for experimentation.