-
-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a config not to shuffle merged dataset #1394
Add a config not to shuffle merged dataset #1394
Conversation
Resolves #1393 |
if cfg.not_shuffle_merged_datasets: | ||
LOG.info("NOT shuffling merged pretraining datasets") | ||
else: | ||
dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if this is intended to shuffle the pre-training dataset (it is a single dataset) within the buffer size?
I am a bit curious. This only disables shuffling when merging datasets. It is still intended to shuffle within dataset? |
Does this PR require some more changes to be merged? PTAL |
…kip ci] * Add a config not to shuffle merged dataset * Update README.md * Update src/axolotl/utils/config/models/input/v0_4_1/__init__.py Co-authored-by: Wing Lian <[email protected]> * invert the condition name * update README * info -> debug --------- Co-authored-by: Wing Lian <[email protected]>
* Add a config not to shuffle merged dataset * Update README.md * Update src/axolotl/utils/config/models/input/v0_4_1/__init__.py Co-authored-by: Wing Lian <[email protected]> * invert the condition name * update README * info -> debug --------- Co-authored-by: Wing Lian <[email protected]>
Add a config not to shuffle merged dataset
Description
Added a config named
not_shuffle_merged_datasets
, which I have been using in my fork for a long time :)Motivation and Context
When training a model to expand its vocab with non-English tokens, I usually start with parallel corpora and then train it on web-crawled or something suitable for pre-training.
It is better giving the user to have an option not to shuffle the merged datasets anyway.
How has this been tested?
This config has been used in my fork for a long time and I verified that it works by seeing the loss graph.
Screenshots (if appropriate)
N/A
Types of changes
New feature (non-breaking change which adds functionality)