Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mix all the user's generated skills instead of 30 per leaf node #421

Closed

Conversation

bbrowning
Copy link
Contributor

Users have provided feedback that they expect all of their generated skills to get mixed in the final output dataset by default instead of truncating this to only 30. The choice of 30 was not a great default for every use case anyway, and mostly tailored towards very large taxonomies with large numbers of skills. Most of our users are using fewer numbers of skills, and expect a larger amount of generated skill data to make it into the results to increase the overall number of skill samples in the mixed output that matches their custom skills.

Fixes #420

Users have provided feedback that they expect all of their generated
skills to get mixed in the final output dataset by default instead of
truncating this to only 30. The choice of 30 was not a great default
for every use case anyway, and mostly tailored towards very large
taxonomies with large numbers of skills. Most of our users are using
fewer numbers of skills, and expect a larger amount of generated skill
data to make it into the results to increase the overall number of
skill samples in the mixed output that matches their custom skills.

Signed-off-by: Ben Browning <[email protected]>
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning Thanks for the PR. I like the idea of mixing all the samples that are being generated, but we need to control/ensure how much we are generated vs how much we are mixing. Should we continue to default to maybe 30/50 and let the users override given special cases. Having said that, should we make this a parameter that we can expose?

Copy link

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an improvement to me. FWIW, I agree it would be even better to make it a parameter as @aakankshaduggal suggests above.

@bbrowning
Copy link
Contributor Author

I agree that users need the ability to control how many skills get mixed. However, do we want to introduce a new parameter for that here? Or instead tackle that once we expose data mixing directly via the CLI? I'm personally inclined to have everything mix in the default recipes at a sampling size of 1.0 as the starting point, and then show advanced users how to re-mix their mixed dataset using Recipes, which gives them fine-grained control over the sampling size of every individual leaf node in the mixed dataset.

@bbrowning
Copy link
Contributor Author

We had some discussion outside GitHub about this, and perhaps it's not worth doing until we expose knobs to users to control the number of samples generated (which --sdg-scale-factor plays a role in, only sometimes) and the number of samples mixed (which has no exposed knob today).

@bbrowning
Copy link
Contributor Author

Closing this for now - we can revisit later if needed.

@bbrowning bbrowning closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change default skills sampling size during datamixing to "1.0" from "30"
3 participants