Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADD] Allow for immediate pre-emption with correct wandb log rewinding #21

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

f-dangel
Copy link
Owner

This PR aims to address #16 and is a DRAFT to discuss problems.

The goal is to allow the Checkpointer to pre-empt immediately when receiving a signal, without writing a new checkpoint (because we may not have enough grace time). This means there can be duplicate runs if wandb.log was called between the last call of .step and the signal reception.

  • As a first step to achieve this, we should store wand.run.step inside a checkpoint. We can then restore that step when loading the latest checkpoint. This PR currently does this using the extra_info option, and I think we can discuss better ways once everything is working.
  • Next, we need to make sure that we correctly rewind the logs. There are two ways, both of which currently fail
    1. Specifying resume_from=<run_id>?_step=<step> in wandb.init. This does not work out of the box and one must request this feature to be enabled from the wandb support, see here.
    2. Setting wandb.run.step manually. This fails with an AttributeError: can't set attribute
    3. Are there other ways, like manually deleting logs?
  • Last, we need to add a flag to the Checkpointer which, if enabled, will make it pre-empt and requeue immediately when receiving a signal.

@scottclowe do let me know if you have other ideas how to achieve this and feel free to push to this branch for experimentation.

@f-dangel f-dangel marked this pull request as draft September 14, 2024 15:49
@f-dangel f-dangel changed the title [DRAFT] Allow for immediate pre-emption with correct wandb log rewinding [ADD] Allow for immediate pre-emption with correct wandb log rewinding Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant