Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader #128

lessw2020 · 2024-03-12T18:06:20Z

As part of e2e training, encountered wild loss curve spikes:

After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to the model, it sees data type A...learns and improves, then hits data type B...suprised (spikes) but then learns and improves, repeat.

By training with a 'single data source' dataset, in this case openwebtext, we see a very nice loss curve on e2e training, showcasing that the issue is the lack of shuffling:

XinDongol · 2024-05-08T21:17:48Z

@tianyu-l @lessw2020 FYI, I am using this trick.

  hf_ds = HuggingFaceDataset(
      dataset_name, dataset_path, tokenizer, seq_len, world_size, rank, infinite
  )
  if shuffle:
      hf_ds._data = hf_ds._data.shuffle(seed=int(rank*10007+int(time.time())))

TJ-Solergibert · 2024-05-10T21:41:27Z

@XinDongol Why would you shuffle the dataset with that seed? Now that Stateful DataLoaders will merge soon, you won't be able to resume training from a crash properly because you don't know how you shuffled the dataset.

Random seeds are used to ensure that results are reproducible, in this case it's completely the opposite.

tianyu-l · 2024-05-14T01:28:33Z

  hf_ds = HuggingFaceDataset(
      dataset_name, dataset_path, tokenizer, seq_len, world_size, rank, infinite
  )
  if shuffle:
      hf_ds._data = hf_ds._data.shuffle(seed=int(rank*10007+int(time.time())))

@XinDongol For map-style dataset, this works as expected. However, for IterableDataset a buffer is used to create apply randomness within. The issue won't be fixed if the buffer size is not / cannot be large enough to cover different amalgamated datasets.

@TJ-Solergibert Checkpointing the random seeds used to shuffle the dataset would solve the problem. FYI it is on our roadmap.

TJ-Solergibert · 2024-05-14T20:54:57Z

Thanks for your answer @tianyu-l , it makes sense 😅

I was wondering, any idea to not use .skip() when resuming training? In my setup (& colab), skipping 10000000 samples took 90s approximately.

from datasets import load_dataset
ds = load_dataset("allenai/c4", name="en", split="train", streaming=True)
ds = ds.skip(10000000)
ds = iter(ds)
next(ds)

tianyu-l · 2024-05-15T03:10:58Z

I was wondering, any idea to not use .skip() when resuming training? In my setup (& colab), skipping 10000000 samples took 90s approximately.

@TJ-Solergibert

We should use .skip() when resuming training. In fact, it has been put into Use stateful dataloader to checkpoint data iteration order and token buffer #279.
It doesn't mean this is the ideal solution. E.g., the C4 en section has more than 300M entries, which, according to your example, means over 45min to skip if we stop somewhere towards the end of the dataset. Ideally, even for streaming=True IterableDataset, skip should be able to directly seek the file position. As far as we know this is something HF is working on.

tianyu-l · 2024-10-15T22:33:56Z

Shuffling at the entire dataset level should be part of data preprocessing, not data loading. So closing this task for now.

tianyu-l added this to the Efficient data loading solution for large datasets milestone May 3, 2024

tianyu-l added the enhancement New feature or request label May 3, 2024

tianyu-l assigned gokulavasan May 16, 2024

tianyu-l unassigned gokulavasan Oct 15, 2024

tianyu-l closed this as completed Oct 15, 2024

tianyu-l mentioned this issue Oct 21, 2024

data shuffling #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader #128

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader #128

lessw2020 commented Mar 12, 2024

XinDongol commented May 8, 2024

TJ-Solergibert commented May 10, 2024

tianyu-l commented May 14, 2024

TJ-Solergibert commented May 14, 2024

tianyu-l commented May 15, 2024

tianyu-l commented Oct 15, 2024

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader #128

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader #128

Comments

lessw2020 commented Mar 12, 2024

XinDongol commented May 8, 2024

TJ-Solergibert commented May 10, 2024

tianyu-l commented May 14, 2024

TJ-Solergibert commented May 14, 2024

tianyu-l commented May 15, 2024

tianyu-l commented Oct 15, 2024