-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training hangs with lightning ddp and cloud dir? #408
Comments
Hi! thanks for your contribution!, great first issue! |
Hi @rxqy, thanks for opening the issue. A similar issue is also open for Sagemaker. We're looking into it and will try to fix it ASAP. |
@deependujha, Many thanks. BTW, the above code sometimes gives the FileNotFoundError (and the training loop continues for several iterations and hangs), and sometimes it just hangs. Not sure if it will help or not, but i'm still pasting it here.
|
Hey @rxqy. Could you try to add a try / except around it in LitData and let us know if it helps ? There is a race condition on deleting the file but it is file to catch and skip it. If it helps, would you mind making a PR with the fix ? |
Hi @tchaton. I think this should be on the lightning side? I wrote a pytorch ddp demo. With the exact same dataloader, we can finish training quite smoothly.
|
Just to clarify, I made no code change to my litdata or lightning package. And we are not using fabric in our trainer. |
You should instantiate the dataset in the setuo hook of the datamodule or directly within the dataloader hook |
@tchaton We're running into an identical issue. We also are getting:
We are initiating our dataset in
When using DDP with remote data, we get 1 iteration/second in terms of speed. After the 1st epoch, 15-16 steps run forward at 1 iteration/second and then training stalls for 3-5 minutes (no GPU utilization). Any ideas what the underlying issue could be? |
@tchaton a +1 to this, the issue we noted is that one of the dataloader worker threads ends up caught at exactly the loop you've added a timeout to in #456, however the For what it's worth, I was having a hard time tracking this bug down last week, and found it persists even when all of the |
Training on 4 nodes, 8 gpu each, with DDP. Still see this problem with #456 in. Currently have training running by
|
Hey @JackUrb. Thanks for the info. Yes, we need to spend more time finding the source of this bug. I wonder if you could add more prints to see if you learn more. Happy to pair debug with you |
I've also got this running stably by introducing a count file for the shards. Running a 4x8 job at the moment that appears to be stable with:
At the moment though, many don't cleanup as I end up with many counts that never go back to 0. Once I get an 8x8 job stable under this setup, I'll try with compression as well, and if that looks good I'll open a PR for my count-lock change. |
Hey @JackUrb. Thanks great to hear ! Feel free to make a draft PR already, so I can have a look and maybe investigate on my end too. Best regards, |
eventually ran into the issue which manifests as a slowdown, presumably waiting for download constantly |
🐛 Bug
Hi, we are using lightning with litdata on our local machine and aws s3 system. However, training would hang randomly during the very first iterations with ddp and remote cloud directory.
I tried several different configurations, but I'm not sure what I should check next.
GPU / Strategy / FileOn / results
1 / No DDP/ local ssd / OK
1 / No DDP/ remote(s3) / OK
8 / DDP/ local ssd / OK
8 / DDP/ remote(s3) / Stuck.
To Reproduce
I'm following the exact steps on the imagenet demo. And I write a trainer myself here.
Just run python train.py with different CUDA_VISIBLE_DEVICES is enough
Code sample
Expected behavior
Training should finish
Additional context
Due to some regulations here we can not put we data or training scirpts on lightning-studio. I'm not sure if something's wrong with our s3 bucket or our our network configuration.
One thing I notice is that even if the training stucks at some iterations(<50), we can still observe large network throughputs on our machine (around 100mb/s), but the local chunk directory( ~/.lightning/chunks) stops growing.
Current environment
The text was updated successfully, but these errors were encountered: