-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-process/multi-node sharding for S3IterableDataset
#53
Comments
Related pull request for Megatron: NVIDIA/Megatron-LM#729 |
The torchdata |
Changes was released as part of v1.3.0 |
👋 folks, you may want to edit the documentation at https://github.com/awslabs/s3-connector-for-pytorch/blob/main/examples/Getting%20started%20with%20the%20Amazon%20S3%20Connector%20for%20PyTorch.ipynb
|
Hi @noepionentrust, good catch: we will update the documentation in an upcoming revision. Thanks! |
We currently don't have a built in way to do sharding for
S3IterableDataset
, so every worker process in aDataLoader
will see the same stream of objects. We should have a way to do this.In the meantime, something like this from
torchdata
will work as a workaround:The text was updated successfully, but these errors were encountered: