fix: upsample the phase10 knowledge dataset #377

RobotSail · 2024-11-13T22:21:15Z

When we mix the knowledge dataset with skills today, we do not account for the potential discrepancy
in size between the generated knowledge data and skills data. This leads to the models potentially
forgetting the data it was trained on in the knowledge phase. As a simple workaround, we simply
upsample the knowledge samples before mixing them in with the generated skills dataset.

Signed-off-by: Oleg S [email protected]

bbrowning · 2024-11-14T00:11:56Z

Is upsampling a special case here? Or is it just that we need to adjust our mixing recipe in use for these knowledge leaf node(s) to have a fixed sampling size or a sampling ratio larger than the default of 1.0? See _sample_ds and _adjust_train_sample_size in datamixing.py for examples of what we already do today. And, you'll see that there's an optional fourth parameter to _gen_leaf_node_data that is the sampling size, which means we could pass in the desired fixed number of samples or ratio of samples when generating the knowledge leaf node data and then that would get written out to the recipe.yaml file as well as used for our final data mixing to scale up the knowledge samples.

src/instructlab/sdg/datamixing.py

src/instructlab/sdg/llmblock.py

bbrowning

This looks like a very reasonable solution to the upscaling problem in such a short time. I actually don't think it's quite as hacky as the code comments imply, but agree it's not an ideal solution to this general problem.

I'm running the full data generation pipeline against a sample taxonomy with a skill leaf node, a knowledge leaf node, and with a precomputed skills dataset getting mixed in via a customized default skills recipe (ie using https://github.com/instructlab/sdg/blob/main/docs/data_mixing.md#using-instructlab-community-pre-generated-dataset). However, I don't think this will finish on my available hardware before I head out for the night. If for some reason it errors out overnight because of these changes, I'll leave a note tomorrow.

Other than the one nit about replacing the stdout print with a logger, this looks ready to go. Since I'll be scarce tomorrow, going ahead and giving this one approval.

Thanks for the detailed PR, taking a couple of iterations on this to make it far less hacky than originally proposed, and the attention to detail with code comments and type hints!

src/instructlab/sdg/datamixing.py

When we mix the knowledge dataset with skills today, we do not account for the potential discrepancy in size between the generated knowledge data and skills data. This leads to the models potentially forgetting the data it was trained on in the knowledge phase. As a simple workaround, we simply upsample the knowledge samples before mixing them in with the generated skills dataset. Signed-off-by: Oleg S <[email protected]>

khaledsulayman

This is a good approach, thanks for working to get this in!

aakankshaduggal

Thanks @RobotSail 🚢

mergify bot added the ci-failure label Nov 13, 2024

bbrowning reviewed Nov 14, 2024

View reviewed changes

src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved

RobotSail force-pushed the upscale-factor branch from 842b43f to 1695d4b Compare November 14, 2024 04:48

mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024

RobotSail force-pushed the upscale-factor branch from 1695d4b to 6809fad Compare November 14, 2024 04:49

mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024

RobotSail force-pushed the upscale-factor branch from 6809fad to d6a6e7c Compare November 14, 2024 13:03

mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024

bbrowning reviewed Nov 14, 2024

View reviewed changes

src/instructlab/sdg/llmblock.py Outdated Show resolved Hide resolved

RobotSail requested a review from aakankshaduggal November 15, 2024 02:02

RobotSail force-pushed the upscale-factor branch from d6a6e7c to 6fb7222 Compare November 15, 2024 02:04

mergify bot removed the ci-failure label Nov 15, 2024

bbrowning approved these changes Nov 15, 2024

View reviewed changes

src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved

mergify bot added the one-approval label Nov 15, 2024

RobotSail force-pushed the upscale-factor branch from 6fb7222 to efaa693 Compare November 15, 2024 03:41

mergify bot added the ci-failure label Nov 15, 2024

RobotSail force-pushed the upscale-factor branch from efaa693 to 18e7e42 Compare November 15, 2024 04:05

mergify bot removed the ci-failure label Nov 15, 2024

RobotSail requested review from aakankshaduggal and removed request for aakankshaduggal November 15, 2024 13:52

khaledsulayman approved these changes Nov 15, 2024

View reviewed changes

mergify bot removed the one-approval label Nov 15, 2024

aakankshaduggal approved these changes Nov 15, 2024

View reviewed changes

mergify bot merged commit f42ea19 into instructlab:main Nov 15, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: upsample the phase10 knowledge dataset #377

fix: upsample the phase10 knowledge dataset #377

RobotSail commented Nov 13, 2024

bbrowning commented Nov 14, 2024

bbrowning left a comment

khaledsulayman left a comment

aakankshaduggal left a comment

fix: upsample the phase10 knowledge dataset #377

fix: upsample the phase10 knowledge dataset #377

Conversation

RobotSail commented Nov 13, 2024

bbrowning commented Nov 14, 2024

bbrowning left a comment

Choose a reason for hiding this comment

khaledsulayman left a comment

Choose a reason for hiding this comment

aakankshaduggal left a comment

Choose a reason for hiding this comment