-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddp backend fix and documentation changes #68
base: main
Are you sure you want to change the base?
Conversation
|
||
gathered = [] | ||
n_chunks = n_samples // self.gather_frequency + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krishansubudhi For my understanding , why did we have to chunk before? I assumed it was to avoid exceeding GPU memory limit but it looks like we only move tensors to GPU in this loop and never out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding @aminsaied who also initially created the DDP Trainer backend and chunking logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop first moves the tensors to GPU, then does all gather op, then moves the gathered tensors back to CPU. I believe this was at the request of @gshruti95 at the time for a specific workload that was being tested (keep me honest Shruti).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to introduce chunking in case of potential memory or timeout issues when trying to all gather for pretraining workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this logic will have to change back then, chunking needs to be implemented correctly and not hard coded to 1, but can be for the time being (just will be slow I think).
No description provided.