Distributed Checkpointing #275

hlnchen · 2025-06-02T22:36:14Z

3 config added to control checkpoints:

checkpoint_dir: directory of the checkpoint, use gs://bucket/path/to/checkpoint to save to gcs bucket
resume_from_checkpoint:
- if null, will not load checkpoint but load from huggingface pretrained weights
- if a positive integer step, checkpoint manager will try to find and load weights from checkpoint under checkpoint_dir/resume_from_checkpoint/
- if latest or the step not found by the manager, then last checkpoint will be loaded
save_steps: save frequency
checkpoint state dict
- model
- optimizer
- lr_scheduler
- step

after checkpoint loading will skip first step iterations by looping the dataloader.

What's not included:

async checkpointing
saving in Huggingface format

google-cla · 2025-06-02T22:36:17Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

yaoshiang · 2025-06-03T15:30:21Z

Hi Haolin Chen @hlnchen , thanks so much for starting this PR! There are a few things I'd ask to get started on this.

Please sign the contributor agreement. You should seek legal advice on this, particularly if these contributions are on behalf of your employer.
Please resolve merge conflicts. I know this is a moving target since main is constantly moving.
Please ensure linting rules are followed. We use ruff almost OOTB, so this hopefully won't be a serious lift.
Please include unit testing as well as manual performance testing to demonstrate that this is working as intended, particularly on larger models. I realize you used the torch_xla's distributed checkpointing functionality, however, I have looked at the tests of that in the past and to my knowledge, it was never tested on a large model. I think there's a high risk of an inadvertent all-reduce in there, which we can really only answer with performance checks and patching into the underlying tensors.

Thanks so much! Hopefully the above isn't too much work to get a contribution in, we'd love to have your support.

vlasenkoalexey · 2025-06-03T16:53:53Z

One more comment, please make sure that if checkpointing is not enabled, PR has no effect on the way models are currently trained. We can relax this limitation later, but for now let's play it safe to make sure that new functionality doesn't break anything.

Haolin Chen added 2 commits May 29, 2025 16:49

update

90fda0f

clean up

8b94a02

Merge branch 'main' into distributed_checkpointing

09eaa65

vlasenkoalexey requested review from tengyifei, vlasenkoalexey and yaoshiang June 2, 2025 23:38

yaoshiang removed request for tengyifei and vlasenkoalexey June 3, 2025 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed Checkpointing #275

Distributed Checkpointing #275

Uh oh!

hlnchen commented Jun 2, 2025

Uh oh!

google-cla bot commented Jun 2, 2025

Uh oh!

yaoshiang commented Jun 3, 2025

Uh oh!

vlasenkoalexey commented Jun 3, 2025

Uh oh!

Uh oh!

Distributed Checkpointing #275

Are you sure you want to change the base?

Distributed Checkpointing #275

Uh oh!

Conversation

hlnchen commented Jun 2, 2025

Uh oh!

google-cla bot commented Jun 2, 2025

Uh oh!

yaoshiang commented Jun 3, 2025

Uh oh!

vlasenkoalexey commented Jun 3, 2025

Uh oh!

Uh oh!