Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step count is not reset when loading a checkpoint and resetting the epoch #1654

Open
mmueller00 opened this issue Nov 26, 2024 · 2 comments
Open

Comments

@mmueller00
Copy link

Current behavior in the torch engine when using a Checkpoint during training via "import_model_train_epoch1" is to reset the epoch to 0 but keeping the global train step count of the checkpoint (see

def _load_model(self):
). Is this the expected behavior or should we reset the step count as well?

@albertz
Copy link
Member

albertz commented Nov 26, 2024

For reference on the code, in get_epoch_model, this is the relevant case when training is done and import_model_train_epoch1 is set:

        elif config.value("task", "train") == "train" and import_model_train_epoch1 and start_epoch in [None, 1]:
            epoch_model = (0, import_model_train_epoch1)

And then the start epoch in training is last_epoch + 1, i.e. 1 here.

However, there is no such logic for the global train step. It just overtakes what it gets in the model checkpoint.

(Note, this logic on epoch/step in this _load_model func is a bit strange because for the _create_model call, it wants to use the right epoch/step of the model checkpoint.)

So, if we want to change that, and also start with step 0, it means:

  • Instead of the step -= 1, do:
if epoch == 0:
    step = 0
else:
    step -= 1
  • Instead of the step += 1, do:
if epoch != 1:
    step += 1

(Or maybe use start_epoch instead of 1.)

So the main question is, should we just change this, and this is good for everyone? Or make an option for it?

@albertz
Copy link
Member

albertz commented Nov 26, 2024

Please vote here on this comment, use:

  • 👍: if you think we should just change this without option (changing current behavior), or
  • 👎: if we should add an option for it which is disabled by default (i.e. keeping current behavior if you don't use the option).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants