[ENH] Mark wandb run as preempting as soon as signal received #17

scottclowe · 2024-09-12T20:44:46Z

Mark the job as preempting on wandb as soon as the interruption signal is received, so if the job suddenly dies wandb will report it as preempting/preempted and not crashed/failed. This ensures it is clear when reading wandb which jobs have stopped because they were preempted and which died of their own accord. This doesn't make much impact in the setting where we reach the time limit of the job and requeue it due to SIGUSR1, but will make more of an impact when the job is interrupted without much preemption warning due to queue priority.

Note that we somewhat redundantly still call wandb.mark_preempting() later in the code in Checkpointer.preempt_wandb_run. This is just to make sure the job is definitely marked correctly on wandb even if the two methods are called in quick succession.

f-dangel

One problem I see with this approach is that this might lead to non-continuous wandb logs:

Assume a run receives a signal, then trains and logs for a few more steps but suddenly gets killed before reaching .step where a new checkpoint would be created. After a while, wandb will consider this run as 'Preempted', because there was no heartbeat. Consequently, this run will be added back to the agent's queue. But after resuming the run, the checkpoint that is going to be loaded will cause duplicate logs, namely those which happened between the checkpoint and the unexpected crash.

If my concern is valid, then I think it would be better to not introduce the possibility of getting duplicate logs, but rather fail in such scenarios.

A similar way to handle such situations is described in #16. But I think the solution proposed there could be better because the user must actively enable a flag, and we can document that enabling this flag comes with additional responsibilities.

scottclowe · 2024-09-13T02:57:00Z

Yes, that can happen. Good point.

I think we should integrate this change with some proper handling of interrupted jobs.

One way to handle potential repeated steps could be to have the checkpointer save wandb.run.step to the checkpoint (I believe as this is a global we can inspect it without needing any arguments to be passed to the checkpointer) and then when loading the checkpoint we can check whether wandb.run.step matches the checkpointed value. If it doesn't match (and the user has not enabled some "interrupt with wind-back" option) then we throw an error or raise a warning?

f-dangel · 2024-09-13T13:52:13Z

That sounds like a good solution 👍

[ENH] Mark wandb run as preempting without waiting for requeueing

4b334de

scottclowe requested a review from f-dangel September 12, 2024 20:44

f-dangel reviewed Sep 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Mark wandb run as preempting as soon as signal received #17

[ENH] Mark wandb run as preempting as soon as signal received #17

scottclowe commented Sep 12, 2024

f-dangel left a comment

scottclowe commented Sep 13, 2024

f-dangel commented Sep 13, 2024

[ENH] Mark wandb run as preempting as soon as signal received #17

Are you sure you want to change the base?

[ENH] Mark wandb run as preempting as soon as signal received #17

Conversation

scottclowe commented Sep 12, 2024

f-dangel left a comment

Choose a reason for hiding this comment

scottclowe commented Sep 13, 2024

f-dangel commented Sep 13, 2024