Skip to content

Monolithic checkpointing #3876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 26, 2025
Merged

Monolithic checkpointing #3876

merged 4 commits into from
Jun 26, 2025

Conversation

rithwik-db
Copy link
Contributor

@rithwik-db rithwik-db commented Jun 11, 2025

Added monolithic checkpointing for FSDP2

Tested test runs in this comment + added unit tests that check tied weights (although in general, tied modules is invalid and we raise an error for that)

@rithwik-db rithwik-db force-pushed the monolithic-checkpoiting branch 2 times, most recently from 5d9b3fc to c26fa2d Compare June 13, 2025 00:14
@rithwik-db rithwik-db changed the title [WIP] Monolithic checkpointing Monolithic checkpointing Jun 13, 2025
Copy link
Contributor

@bowenyang008 bowenyang008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you have a chance to run an e2e test with finetuning?

@rithwik-db rithwik-db force-pushed the monolithic-checkpoiting branch from c26fa2d to 465ad45 Compare June 17, 2025 03:16
@rithwik-db rithwik-db force-pushed the monolithic-checkpoiting branch from 0fcbe52 to 1eeee6f Compare June 25, 2025 02:41
@rithwik-db
Copy link
Contributor Author

@bowenyang008 image <- mpt-125m-fsdp2-monolithic-resumption-J6JwLr looks like resumption is working correctly and the llama finetune regression went as expected as well (llama2-finetune-regression-LbPI17)

@rithwik-db rithwik-db requested a review from bowenyang008 June 25, 2025 03:09
Copy link
Contributor

@bowenyang008 bowenyang008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few questions and LGTM!

WIP

figured out where things are going wrong

fixed some things

formatted

works?

added comment

some minor changes

added some changes

made some minor changes

added some logging

made another update

undid previous change

some more changes

printing average of state

checking difference before and after wrapping

checking sync_module_states

additional logging

printing out valid params
@rithwik-db rithwik-db force-pushed the monolithic-checkpoiting branch from 1eeee6f to 3dedd6a Compare June 26, 2025 00:59
@rithwik-db rithwik-db enabled auto-merge (squash) June 26, 2025 05:18
@rithwik-db rithwik-db merged commit 39f118a into main Jun 26, 2025
13 checks passed
@rithwik-db rithwik-db deleted the monolithic-checkpoiting branch June 26, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants