Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new logic to enable stored checkpoint weights to be copied to new history dimensions #36

Merged
merged 11 commits into from
Sep 23, 2024

Conversation

scottcha
Copy link
Contributor

Here is a possible fix for #35. Currently the new logic attempts to copy the provided weights into any new history dimensions which get allocated by setting max_history_size > 2.

@scottcha
Copy link
Contributor Author

@microsoft-github-policy-service agree

Copy link
Contributor

@wesselb wesselb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @scottcha! Thanks for opening a PR!! I think this is useful feature. :) A few comments:

I think we should not allow the user to decrease the value of max_history_size, as that would very likely break performance. Could you add an assertion for this?

In addition, currently this only works for the pretrained checkpoint, not any checkpoint saved after that. Would you be able to separate the checkpoint-adaptation logic from the max_history_size-adaptation logic, so this would also work for checkpoints saved after that? We might at some point release additional models to which the checkpoint-adaptation logic does not apply, but for which you might want to change max_history_size.

@scottcha
Copy link
Contributor Author

@wesselb that sounds like good suggestions. I'll update the PR addressing those. It might take a couple days based on my availability but its no problem.

@wesselb
Copy link
Contributor

wesselb commented Sep 18, 2024

That would be perfect! Thanks :)

@scottcha
Copy link
Contributor Author

PR is updated with the requested changes.

  1. I scoped the changes to just the suggestions provided and limited to the history size adaptation.
  2. I might recommend that the checkpoint weight adaptation is also refactored to its own method helping make it more testable and overridable. Let me know if you want to see this and I can provide a separate PR scoped to that change.
  3. I provided a new test file for this method and added the relevant tests.
  4. Beyond the unit tests provided I've also validated the functional cases here:
  • AuroraSmall loads with provided small checkpoint with max_history_size=2 (default)
  • AuroraSmall loads with provided small checkpoint with max_history_size=4
  • AuroraSmall can save new checkpoint after loading with max_historysize=4 and that can be reloaded after being serialized
  • AuroraSmall 2 step rollout continues work for 2t after loading checkpoints above
    Also note that one AuroraSmall test case (test_aurora_small) seems to be failing on your main branch and it seems to be unrelated to my changes here. All other tests are passing.
    Let me know if you have any other feedback.
    thanks!

Copy link
Contributor

@wesselb wesselb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottcha Amazing work. I've got a few very minor comments, and then I think this is ready to go once CI passes!

Thank you!!

aurora/model/aurora.py Outdated Show resolved Hide resolved
aurora/model/aurora.py Outdated Show resolved Hide resolved
aurora/model/aurora.py Outdated Show resolved Hide resolved
@scottcha
Copy link
Contributor Author

Ok, pushed changes to address the feedback. Thanks!

@wesselb
Copy link
Contributor

wesselb commented Sep 20, 2024

@scottcha I think this is ready to be merged. Would you be able to run the formatter over the PR once? I think that should fix the formatting check.

The easier way by doing this is by setting up pre-commit and running that on all files. That should perfectly match the formatter ran in the checks.

@scottcha
Copy link
Contributor Author

Got it, handn't noticed the pre-commit component. Pushed changes to resolve those issues.
thanks!

Copy link
Contributor

@wesselb wesselb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this looks ready to go! Thanks, @scottcha :)

@wesselb wesselb merged commit c694fca into microsoft:main Sep 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants