-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamingDataset causes NCCL timeout when using multiple nodes #340
Comments
Hi! thanks for your contribution!, great first issue! |
Hey @hubenjm, Did you provide Could you share a reproducible script or the code of your training dataset ? |
Hey @hubenjm. Any updates ? |
@tchaton Thanks for the suggestions. I am currently trying to run my code again while explicitly setting |
OK, that did not work either, so I am going to have to work on creating a simpler example code to share that replicates the problem |
To replicate the problem, you first need to run Then after that data is generated in s3, to submit a training job in SageMaker with e.g. 2 nodes, follow the If you want to run the training code in your own cluster via
or replace NOTE that I only replicated the error using the SageMaker training job approach above, but I don't think there's any significant difference between running it there versus on a self-managed cluster, since under the hood Sagemaker will execute a very similar torchrun command as above. With above code example and arguments used I got a softlock to occur at around epoch 5 with 2 nodes. With 1 node it runs fine. |
OK, another update. |
My next step will be to try getting a multi node sagemaker training job working with the same code but replacing the dataset/data loader with standard torch dataset and DataLoader class. If that doesn't work then I suppose this issue is moot and the problem is something else. But it would be very useful to a lot of folks in general to be able to use LitData and Lightning effectively with multinode sagemaker training jobs. |
Hey @hubenjm Could you check the dataset length or the number of batch read on each rank ? This can happen if somehow the length wasn't inferred properly and one rank gets more data. |
Hey @hubenjm. If you are available next week, let's try to get us to reproduce this issue on Lightning.ai. If I can reproduce it, I can fix it. |
@tchaton Sure I will try to help out. As an update, I ran some more tests a couple weeks ago and I found the following specific to SageMaker
I can work on streamlining my code example more to make it easier to work with. My current guess is that the problem lies somehow with how the distributed process group is being set up with StreamingDataLoader vs with the standard torch DataLoader. And maybe it has to do with some behind the scenes setup that SageMaker does with environment variables and in renaming the hosts as 'algo-1', 'algo-2', etc. |
Hey @hubenjm. This happens if the number of batches isn't the same on all ranks. For the training streaming dataset, do you provide Yes, a reproducible example would be super helpful. |
Yes I do set |
litdata_multinode_example_code.tar.gz From README.md in .tar.gz attached: OverviewThis code is intended to test out ability to run distributed (DDP) training jobs in SageMaker with multiple nodes using PyTorch Lightning with or without LitData StreamingDataset as the data source. Instructions
|
Thanks @hubenjm. Multi node on Lightning.AI is much simpler and cheaper than Sagemaker. You should give it a try. It also support fault tolerance with automatic restart. Here are the docs: https://lightning.ai/docs/overview/train-models/multi-node-training. I will try to find some time to look into this. Thanks. |
I have the same issue, is there any fix available for this |
Hi Pavan,
Could you help us with a minimal reproducible script or Lightning Studio?
Also, could you describe the scenario a bit?
I tried reproducing it once but wasn’t successfull.
|
Hi Bhimraj, Sure, I am trying to pre-train a VLM model. I have created an optimized dataset using litdata in EC2. I had to do it in chunks and merge them later. Then, I am using Sagemaker with p5 instances to train the model. I am trying to use FSDP or DeepSpeed (either is fine). The data is mounted as /fsx to the cluster. I am using a steaming dataset and dataloader. So when I select more than one node, I get an NCCL timeout at the ALL_GATHER operation after some training steps, mostly waiting for data I think. when I change NCCL_TIMEOUT var from the default value, it is just stuck. I also observed that it happened at a particular index always. Slightly different if I change the batch size but always similar tho. I will try to get a sample script. |
Thank you @kandapagari |
Hey @kandapagari. Would you be free to a debugging call ? Could you share access to the dataset ? |
Sure i can join in for a call. You can reach me through my email. The dataset itself is in fsx which i think i cannot give access to directly. Ill see if i can push it into a s3 bucket. Thank you. |
Hey @kandapagari My email is [email protected]. Send me an invitation, so we can look into this. Also, u should try the Lightning-AI platform. This would make your life much simpler |
hey @tchaton, sorry for the late reply, BTW we figured out the reason for the time out. when processing the data using compute the data (chunks) sometimes get corrupted and could not be read. When this happens and we try to load the same data using a streaming dataloader (even with time out set), the timeout doesn't happen and one GPU is still trying to load the data. At the same time, the other GPUs wait for this GPU to process that data which eventually causes NCCL error. We solved it by trying to read all the data offline (on a single machine), removing the corrupted chunks (by dropping them in the index.json file), and then trying to load the data. |
I don't think this is what caused my problem, because my example code posted here worked for a single node just fine. |
Hey @kandapagari. Oh wow, that's super interesting. Any ideas what could have caused the corruption. I have never seen this before. |
Hey @hubenjm. Could you try the latest version of LitData, we fixed a few things and I am curious if this still happen. |
🐛 Bug
I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using
train_weight_factors = [0.8,0.07,0.07,0.07]
. The training stops printing out log messages after some fixed number of batches (depending on random seed I guess). Where the training stops is deterministic if seed is fixed, based on my experiments. Then the NCCL timeout triggers an exception after 30 minutes. The training code works fine on a single node though.To Reproduce
Use CombinedStreamingDataset for training dataset with
train_weight_factors
notNone
anditerate_over_all = False
. Launch training withtorchrun
with num_nodes > 1.Code sample
Expected behavior
Training should not softlock in the middle of an epoch
Environment
conda
,pip
, source): SageMaker prebuilt deep learning container (763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker, see https://github.com/aws/deep-learning-containers/blob/master/available_images.md)Additional context
If you have any other suggestions about why multi-node training with CombinedDataset would fail like this, any help is appreciated.
The text was updated successfully, but these errors were encountered: