Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: upsample the phase10 knowledge dataset #377

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

RobotSail
Copy link
Member

When we mix the knowledge dataset with skills today, we do not account for the potential discrepancy
in size between the generated knowledge data and skills data. This leads to the models potentially
forgetting the data it was trained on in the knowledge phase. As a simple workaround, we simply
upsample the knowledge samples before mixing them in with the generated skills dataset.

Signed-off-by: Oleg S [email protected]

@mergify mergify bot added the ci-failure label Nov 13, 2024
@bbrowning
Copy link
Contributor

Is upsampling a special case here? Or is it just that we need to adjust our mixing recipe in use for these knowledge leaf node(s) to have a fixed sampling size or a sampling ratio larger than the default of 1.0? See _sample_ds and _adjust_train_sample_size in datamixing.py for examples of what we already do today. And, you'll see that there's an optional fourth parameter to _gen_leaf_node_data that is the sampling size, which means we could pass in the desired fixed number of samples or ratio of samples when generating the knowledge leaf node data and then that would get written out to the recipe.yaml file as well as used for our final data mixing to scale up the knowledge samples.

Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very reasonable solution to the upscaling problem in such a short time. I actually don't think it's quite as hacky as the code comments imply, but agree it's not an ideal solution to this general problem.

I'm running the full data generation pipeline against a sample taxonomy with a skill leaf node, a knowledge leaf node, and with a precomputed skills dataset getting mixed in via a customized default skills recipe (ie using https://github.com/instructlab/sdg/blob/main/docs/data_mixing.md#using-instructlab-community-pre-generated-dataset). However, I don't think this will finish on my available hardware before I head out for the night. If for some reason it errors out overnight because of these changes, I'll leave a note tomorrow.

Other than the one nit about replacing the stdout print with a logger, this looks ready to go. Since I'll be scarce tomorrow, going ahead and giving this one approval.

Thanks for the detailed PR, taking a couple of iterations on this to make it far less hacky than originally proposed, and the attention to detail with code comments and type hints!

src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved
When we mix the knowledge dataset with skills today, we do not account for the potential discrepancy
in size between the generated knowledge data and skills data. This leads to the models potentially
forgetting the data it was trained on in the knowledge phase. As a simple workaround, we simply
upsample the knowledge samples before mixing them in with the generated skills dataset.

Signed-off-by: Oleg S <[email protected]>
@mergify mergify bot removed the ci-failure label Nov 15, 2024
@RobotSail RobotSail requested review from aakankshaduggal and removed request for aakankshaduggal November 15, 2024 13:52
Copy link
Member

@khaledsulayman khaledsulayman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good approach, thanks for working to get this in!

@mergify mergify bot removed the one-approval label Nov 15, 2024
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RobotSail 🚢

@mergify mergify bot merged commit f42ea19 into instructlab:main Nov 15, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants