You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current behavior in the torch engine when using a Checkpoint during training via "import_model_train_epoch1" is to reset the epoch to 0 but keeping the global train step count of the checkpoint (see
And then the start epoch in training is last_epoch + 1, i.e. 1 here.
However, there is no such logic for the global train step. It just overtakes what it gets in the model checkpoint.
(Note, this logic on epoch/step in this _load_model func is a bit strange because for the _create_model call, it wants to use the right epoch/step of the model checkpoint.)
So, if we want to change that, and also start with step 0, it means:
Instead of the step -= 1, do:
ifepoch==0:
step=0else:
step-=1
Instead of the step += 1, do:
ifepoch!=1:
step+=1
(Or maybe use start_epoch instead of 1.)
So the main question is, should we just change this, and this is good for everyone? Or make an option for it?
Current behavior in the torch engine when using a Checkpoint during training via "import_model_train_epoch1" is to reset the epoch to 0 but keeping the global train step count of the checkpoint (see
returnn/returnn/torch/engine.py
Line 798 in 13640bc
The text was updated successfully, but these errors were encountered: