-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New repository proposal: sdg #63
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@russellb Can we add a bit about creating a new SDG Triager/Maintainer team, in line with the scheme we have for the CLI teams? These team will be responsible for triaging, maintenance, releases, etc. Alternatively this can be a proposal to the |
I wonder if anyone can come up with a less cryptic name than SDG, but I don't have a good suggestion (data-gen or synth-data... etc don't thrill me) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it so
Dumb question - why don't we just call it synthetic-data-generation? It does what it says on the tin. |
Honestly I'm not opposed to @lhawthorn's suggestion or something like |
I would prefer a rough mapping to the Python package included in the repo. This proposal was instructlab/sdg for the instructlab-sdg package. For the other name suggestions, a PyPI check to see if it can also be the Python package name would be good. It would also be helpful to know if the name suggestions are brainstorming or a “-1” for the proposed name. Naming is hard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal, thanks @russellb.
One thing should we also include the basic implementation (which is a modified version of self-instruct (https://arxiv.org/abs/2212.10560)) currently in the CLI to be moved to the repository also? That way we extract all SDG implementation from the CLI and we have the community supported implementations in 1 repository.
Yes, I envisioned that we would move the current implementation in |
We should also note that the sdg code is also a consumer of the schema since it much be able to understand taxonomy qna.yaml files. I wonder if it time to have the schema repo publish |
big +1 to moving away from git submodules for the schema - a package on pypi makes sense to me |
+1 to this as well |
OK, I'll start some work towards this end. |
This document includes a proposal for a new repository which contains a Python library focused on Synthetic Data Generation (SDG). This is a discreet area of functionality that can evolve on its own with its own group of contributors and maintainers. Signed-off-by: Russell Bryant <[email protected]>
This document includes a proposal for a new repository which contains
a Python library focused on Synthetic Data Generation (SDG). This is a
discreet area of functionality that can evolve on its own with its own
group of contributors and maintainers.
Signed-off-by: Russell Bryant [email protected]