data shuffling #635

eminorhan · 2024-10-20T03:39:35Z

I understand that the current version of the code doesn't shuffle the data during training, i.e. examples are consumed in order in each rank (in fact, there's a note to that effect here). I'm kind of new to large-scale LLM training, so I was just wondering if this is common practice in LLM training. It seems not ideal potentially, since consecutive gradients will likely be more correlated than under random shuffling.

If I wanted to randomly shuffle the data during training, how could I go about doing that? I thought about using ds.shuffle() before splitting the dataset by node here, but that would (pseudo-)shuffle the data rows, which doesn't seem quite right, since I think we really want to shuffle concatenated seq_len long chunks of text instead.

The text was updated successfully, but these errors were encountered:

tianyu-l · 2024-10-21T22:39:57Z

The question seems related to #128.

My understanding is that it depends on how non-random your data is and what your data format is.

If the dataset is not huge and come as a map-style, it is easy to shuffle globally.
If the dataset is huge and comes in a stream, you can only shuffle within some buffer size, rather than at the entire dataset level. If the non-randomness is at the top level, e.g. your data is composed of two irrelevant topics, one after another, then there's no easy way to shuffle between them during training. Instead, one should consider preprocessing the dataset before training. Or you can have two processes, loading the two topics in parallel.

I thought about using ds.shuffle() before splitting the dataset by node here, but that would (pseudo-)shuffle the data rows, which doesn't seem quite right, since I think we really want to shuffle concatenated seq_len long chunks of text instead.

This shouldn't matter too much. I think the underlying assumption is most data rows are shorter than the sequence length, i.e. a training sample would consist of multiple data rows. So shuffling at the data rows level suffices.

eminorhan · 2024-10-22T13:51:46Z

Thanks a lot for the pointers. I suppose this would also become less of an issue with a bigger dp degree (dp_replicate * dp_shard), since each dp degree would be assigned a smaller number of data shards and the shards assigned to each would have less variance, so to speak.

Re: ds.shuffle(), I was thinking that it could be problematic when consecutive data rows in the dataset are related to each other. Shuffling at the level of data rows would break the coherence of the sample, e.g. a chapter from a crime story might get concatenated with a dish recipe in the same sample. I think this isn't the case for c4 (where data rows are from randomized urls), but might be otherwise for other large datasets.

tianyu-l · 2024-10-22T20:12:06Z

e.g. a chapter from a crime story might get concatenated with a dish recipe in the same sample.

Hmm interesting... I think in this case, what you really want is long contexts, which should be solved by putting all the context in the same row, and allow for longer seq_len, rather than refraining from data shuffling.

eminorhan · 2024-10-24T02:08:42Z

Closing this, as I think the decision to shuffle the data or not and how to shuffle it will depend quite a bit on the specifics of the data and how it's structured (as well as other details like dp_degree), so probably outside the scope of this repo. I think the current choice is not unreasonable for demonstration purposes.

tianyu-l added the question Further information is requested label Oct 21, 2024

eminorhan closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data shuffling #635

data shuffling #635

eminorhan commented Oct 20, 2024

tianyu-l commented Oct 21, 2024

eminorhan commented Oct 22, 2024

tianyu-l commented Oct 22, 2024

eminorhan commented Oct 24, 2024

data shuffling #635

data shuffling #635

Comments

eminorhan commented Oct 20, 2024

tianyu-l commented Oct 21, 2024

eminorhan commented Oct 22, 2024

tianyu-l commented Oct 22, 2024

eminorhan commented Oct 24, 2024