[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

f-dangel · 2024-09-14T15:49:09Z

This PR aims to address #16 and is a DRAFT to discuss problems.

The goal is to allow the Checkpointer to pre-empt immediately when receiving a signal, without writing a new checkpoint (because we may not have enough grace time). This means there can be duplicate runs if wandb.log was called between the last call of .step and the signal reception.

As a first step to achieve this, we should store wand.run.step inside a checkpoint. We can then restore that step when loading the latest checkpoint. This PR currently does this using the extra_info option, and I think we can discuss better ways once everything is working.
Next, we need to make sure that we correctly rewind the logs. There are two ways, both of which currently fail
1. Specifying resume_from=<run_id>?_step=<step> in wandb.init. This does not work out of the box and one must request this feature to be enabled from the wandb support, see here.
2. Setting wandb.run.step manually. This fails with an AttributeError: can't set attribute
3. Are there other ways, like manually deleting logs?
Last, we need to add a flag to the Checkpointer which, if enabled, will make it pre-empt and requeue immediately when receiving a signal.

@scottclowe do let me know if you have other ideas how to achieve this and feel free to push to this branch for experimentation.

f-dangel added 2 commits September 14, 2024 11:15

[ADD] Save and restore wandb step

e28ffd2

Try setting wandb step manually

05152ea

f-dangel marked this pull request as draft September 14, 2024 15:49

f-dangel changed the title ~~[DRAFT] Allow for immediate pre-emption with correct wandb log rewinding~~ [ADD] Allow for immediate pre-emption with correct wandb log rewinding Sep 14, 2024

Try using resume_from argument

b9cbf9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

f-dangel commented Sep 14, 2024

[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

Are you sure you want to change the base?

[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

Conversation

f-dangel commented Sep 14, 2024