Skip to content

[feature request] Native checkpointing to/from s3:// #155992

Open
@vadimkantorov

Description

@vadimkantorov

🚀 The feature, motivation and pitch

Sometimes it's beneficial to directly stream checkpoint data to a cloud storage (rather than dump it localy and have some background process handle sync/upload/cleanup) or load weights from s3:// or gs:// path checkpoint weights. I wonder if this also can be related to the recent native HFStorageReader support #154518

I also found that torchsnapshot library (seems abandonned now - last commit 6 months ago) supports this: https://docs.pytorch.org/torchsnapshot/main/getting_started.html


Also there exist a fairly popular packge fsspec which itself wraps some cloud storage libraries and provides caching functionalities, there was some discussion in torchsnapshot on supporting in natively:

And some older (before HF) discussions:

I wonder if some HF utils on checkpointing / HF Hub blob caching structure could be upstreamed to PyTorch. E.g. for loading pretrained weights, this should be good. And maybe some hf:// management/caching could be made to plug into fsspec interface? Then hf:// could be used in all fsspec-using places.

Alternatives

No response

Additional context

No response

cc @mruberry @mikaylagawarecki @LucasLLC @pradeepfn @MeetVadakkanchery @mhorowitz @ekr0

Metadata

Metadata

Assignees

Labels

featureA request for a proper, new feature.module: serializationIssues related to serialization (e.g., via pickle, or otherwise) of PyTorch objectsoncall: distributed checkpointingOncall label should be attached to any issues related to distributed checkpointing.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions