Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Mark wandb run as preempting as soon as signal received #17

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scottclowe
Copy link
Collaborator

Mark the job as preempting on wandb as soon as the interruption signal is received, so if the job suddenly dies wandb will report it as preempting/preempted and not crashed/failed. This ensures it is clear when reading wandb which jobs have stopped because they were preempted and which died of their own accord. This doesn't make much impact in the setting where we reach the time limit of the job and requeue it due to SIGUSR1, but will make more of an impact when the job is interrupted without much preemption warning due to queue priority.

Note that we somewhat redundantly still call wandb.mark_preempting() later in the code in Checkpointer.preempt_wandb_run. This is just to make sure the job is definitely marked correctly on wandb even if the two methods are called in quick succession.

@scottclowe scottclowe requested a review from f-dangel September 12, 2024 20:44
Copy link
Owner

@f-dangel f-dangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem I see with this approach is that this might lead to non-continuous wandb logs:

Assume a run receives a signal, then trains and logs for a few more steps but suddenly gets killed before reaching .step where a new checkpoint would be created. After a while, wandb will consider this run as 'Preempted', because there was no heartbeat. Consequently, this run will be added back to the agent's queue. But after resuming the run, the checkpoint that is going to be loaded will cause duplicate logs, namely those which happened between the checkpoint and the unexpected crash.

If my concern is valid, then I think it would be better to not introduce the possibility of getting duplicate logs, but rather fail in such scenarios.

A similar way to handle such situations is described in #16. But I think the solution proposed there could be better because the user must actively enable a flag, and we can document that enabling this flag comes with additional responsibilities.

@scottclowe
Copy link
Collaborator Author

Yes, that can happen. Good point.

I think we should integrate this change with some proper handling of interrupted jobs.

One way to handle potential repeated steps could be to have the checkpointer save wandb.run.step to the checkpoint (I believe as this is a global we can inspect it without needing any arguments to be passed to the checkpointer) and then when loading the checkpoint we can check whether wandb.run.step matches the checkpointed value. If it doesn't match (and the user has not enabled some "interrupt with wind-back" option) then we throw an error or raise a warning?

@f-dangel
Copy link
Owner

That sounds like a good solution 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants