Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data shuffling #635

Closed
eminorhan opened this issue Oct 20, 2024 · 4 comments
Closed

data shuffling #635

eminorhan opened this issue Oct 20, 2024 · 4 comments
Labels
question Further information is requested

Comments

@eminorhan
Copy link

I understand that the current version of the code doesn't shuffle the data during training, i.e. examples are consumed in order in each rank (in fact, there's a note to that effect here). I'm kind of new to large-scale LLM training, so I was just wondering if this is common practice in LLM training. It seems not ideal potentially, since consecutive gradients will likely be more correlated than under random shuffling.

If I wanted to randomly shuffle the data during training, how could I go about doing that? I thought about using ds.shuffle() before splitting the dataset by node here, but that would (pseudo-)shuffle the data rows, which doesn't seem quite right, since I think we really want to shuffle concatenated seq_len long chunks of text instead.

@tianyu-l tianyu-l added the question Further information is requested label Oct 21, 2024
@tianyu-l
Copy link
Contributor

The question seems related to #128.

My understanding is that it depends on how non-random your data is and what your data format is.

  • If the dataset is not huge and come as a map-style, it is easy to shuffle globally.
  • If the dataset is huge and comes in a stream, you can only shuffle within some buffer size, rather than at the entire dataset level. If the non-randomness is at the top level, e.g. your data is composed of two irrelevant topics, one after another, then there's no easy way to shuffle between them during training. Instead, one should consider preprocessing the dataset before training. Or you can have two processes, loading the two topics in parallel.

I thought about using ds.shuffle() before splitting the dataset by node here, but that would (pseudo-)shuffle the data rows, which doesn't seem quite right, since I think we really want to shuffle concatenated seq_len long chunks of text instead.

This shouldn't matter too much. I think the underlying assumption is most data rows are shorter than the sequence length, i.e. a training sample would consist of multiple data rows. So shuffling at the data rows level suffices.

@eminorhan
Copy link
Author

Thanks a lot for the pointers. I suppose this would also become less of an issue with a bigger dp degree (dp_replicate * dp_shard), since each dp degree would be assigned a smaller number of data shards and the shards assigned to each would have less variance, so to speak.

Re: ds.shuffle(), I was thinking that it could be problematic when consecutive data rows in the dataset are related to each other. Shuffling at the level of data rows would break the coherence of the sample, e.g. a chapter from a crime story might get concatenated with a dish recipe in the same sample. I think this isn't the case for c4 (where data rows are from randomized urls), but might be otherwise for other large datasets.

@tianyu-l
Copy link
Contributor

e.g. a chapter from a crime story might get concatenated with a dish recipe in the same sample.

Hmm interesting... I think in this case, what you really want is long contexts, which should be solved by putting all the context in the same row, and allow for longer seq_len, rather than refraining from data shuffling.

@eminorhan
Copy link
Author

Closing this, as I think the decision to shuffle the data or not and how to shuffle it will depend quite a bit on the specifics of the data and how it's structured (as well as other details like dp_degree), so probably outside the scope of this repo. I think the current choice is not unreasonable for demonstration purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants