-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data shuffling #635
Comments
The question seems related to #128. My understanding is that it depends on how non-random your data is and what your data format is.
This shouldn't matter too much. I think the underlying assumption is most data rows are shorter than the sequence length, i.e. a training sample would consist of multiple data rows. So shuffling at the data rows level suffices. |
Thanks a lot for the pointers. I suppose this would also become less of an issue with a bigger Re: |
Hmm interesting... I think in this case, what you really want is long contexts, which should be solved by putting all the context in the same row, and allow for longer |
Closing this, as I think the decision to shuffle the data or not and how to shuffle it will depend quite a bit on the specifics of the data and how it's structured (as well as other details like |
I understand that the current version of the code doesn't shuffle the data during training, i.e. examples are consumed in order in each rank (in fact, there's a note to that effect here). I'm kind of new to large-scale LLM training, so I was just wondering if this is common practice in LLM training. It seems not ideal potentially, since consecutive gradients will likely be more correlated than under random shuffling.
If I wanted to randomly shuffle the data during training, how could I go about doing that? I thought about using
ds.shuffle()
before splitting the dataset by node here, but that would (pseudo-)shuffle the data rows, which doesn't seem quite right, since I think we really want to shuffle concatenatedseq_len
long chunks of text instead.The text was updated successfully, but these errors were encountered: