Step count is not reset when loading a checkpoint and resetting the epoch #1654

mmueller00 · 2024-11-26T13:47:33Z

Current behavior in the torch engine when using a Checkpoint during training via "import_model_train_epoch1" is to reset the epoch to 0 but keeping the global train step count of the checkpoint (see

returnn/returnn/torch/engine.py

Line 798 in 13640bc

def _load_model(self):

). Is this the expected behavior or should we reset the step count as well?

albertz · 2024-11-26T14:11:37Z

For reference on the code, in get_epoch_model, this is the relevant case when training is done and import_model_train_epoch1 is set:

        elif config.value("task", "train") == "train" and import_model_train_epoch1 and start_epoch in [None, 1]:
            epoch_model = (0, import_model_train_epoch1)

And then the start epoch in training is last_epoch + 1, i.e. 1 here.

However, there is no such logic for the global train step. It just overtakes what it gets in the model checkpoint.

(Note, this logic on epoch/step in this _load_model func is a bit strange because for the _create_model call, it wants to use the right epoch/step of the model checkpoint.)

So, if we want to change that, and also start with step 0, it means:

Instead of the step -= 1, do:

if epoch == 0:
    step = 0
else:
    step -= 1

Instead of the step += 1, do:

if epoch != 1:
    step += 1

(Or maybe use start_epoch instead of 1.)

So the main question is, should we just change this, and this is good for everyone? Or make an option for it?

albertz · 2024-11-26T14:14:41Z

Please vote here on this comment, use:

👍: if you think we should just change this without option (changing current behavior), or
👎: if we should add an option for it which is disabled by default (i.e. keeping current behavior if you don't use the option).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step count is not reset when loading a checkpoint and resetting the epoch #1654

Step count is not reset when loading a checkpoint and resetting the epoch #1654

mmueller00 commented Nov 26, 2024

albertz commented Nov 26, 2024 •

edited

Loading

albertz commented Nov 26, 2024 •

edited

Loading

Step count is not reset when loading a checkpoint and resetting the epoch #1654

Step count is not reset when loading a checkpoint and resetting the epoch #1654

Comments

mmueller00 commented Nov 26, 2024

albertz commented Nov 26, 2024 • edited Loading

albertz commented Nov 26, 2024 • edited Loading

albertz commented Nov 26, 2024 •

edited

Loading

albertz commented Nov 26, 2024 •

edited

Loading