-
Notifications
You must be signed in to change notification settings - Fork 447
Monolithic checkpointing #3876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monolithic checkpointing #3876
Conversation
5d9b3fc
to
c26fa2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you have a chance to run an e2e test with finetuning?
c26fa2d
to
465ad45
Compare
0fcbe52
to
1eeee6f
Compare
@bowenyang008 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few questions and LGTM!
WIP figured out where things are going wrong fixed some things formatted works? added comment some minor changes added some changes made some minor changes added some logging made another update undid previous change some more changes printing average of state checking difference before and after wrapping checking sync_module_states additional logging printing out valid params
1eeee6f
to
3dedd6a
Compare
Added monolithic checkpointing for FSDP2
Tested test runs in this comment + added unit tests that check tied weights (although in general, tied modules is invalid and we raise an error for that)