-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader #128
Comments
@tianyu-l @lessw2020 FYI, I am using this trick. hf_ds = HuggingFaceDataset(
dataset_name, dataset_path, tokenizer, seq_len, world_size, rank, infinite
)
if shuffle:
hf_ds._data = hf_ds._data.shuffle(seed=int(rank*10007+int(time.time()))) |
@XinDongol Why would you shuffle the dataset with that seed? Now that Stateful DataLoaders will merge soon, you won't be able to resume training from a crash properly because you don't know how you shuffled the dataset. Random seeds are used to ensure that results are reproducible, in this case it's completely the opposite. |
@XinDongol For map-style dataset, this works as expected. However, for @TJ-Solergibert Checkpointing the random seeds used to shuffle the dataset would solve the problem. FYI it is on our roadmap. |
Thanks for your answer @tianyu-l , it makes sense 😅 I was wondering, any idea to not use
|
|
Shuffling at the entire dataset level should be part of data preprocessing, not data loading. So closing this task for now. |
As part of e2e training, encountered wild loss curve spikes:
After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to the model, it sees data type A...learns and improves, then hits data type B...suprised (spikes) but then learns and improves, repeat.
By training with a 'single data source' dataset, in this case openwebtext, we see a very nice loss curve on e2e training, showcasing that the issue is the lack of shuffling:
The text was updated successfully, but these errors were encountered: