Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New repository proposal: sdg #63

Merged
merged 1 commit into from
Jun 1, 2024
Merged

Conversation

russellb
Copy link
Member

This document includes a proposal for a new repository which contains
a Python library focused on Synthetic Data Generation (SDG). This is a
discreet area of functionality that can evolve on its own with its own
group of contributors and maintainers.

Signed-off-by: Russell Bryant [email protected]

Copy link
Contributor

@oindrillac oindrillac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nathan-weinberg
Copy link
Member

@russellb Can we add a bit about creating a new SDG Triager/Maintainer team, in line with the scheme we have for the CLI teams? These team will be responsible for triaging, maintenance, releases, etc.

Alternatively this can be a proposal to the community repo CONTRIBUTORING_ROLES.md doc once the new repo is created.

@markstur
Copy link
Member

I wonder if anyone can come up with a less cryptic name than SDG, but I don't have a good suggestion (data-gen or synth-data... etc don't thrill me)

Copy link
Member

@markstur markstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it so

@lhawthorn
Copy link
Member

Dumb question - why don't we just call it synthetic-data-generation? It does what it says on the tin.

@nathan-weinberg
Copy link
Member

Honestly I'm not opposed to @lhawthorn's suggestion or something like synthetic-datagen

@russellb
Copy link
Member Author

I would prefer a rough mapping to the Python package included in the repo. This proposal was instructlab/sdg for the instructlab-sdg package. For the other name suggestions, a PyPI check to see if it can also be the Python package name would be good. It would also be helpful to know if the name suggestions are brainstorming or a “-1” for the proposed name.

Naming is hard.

Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal, thanks @russellb.

One thing should we also include the basic implementation (which is a modified version of self-instruct (https://arxiv.org/abs/2212.10560)) currently in the CLI to be moved to the repository also? That way we extract all SDG implementation from the CLI and we have the community supported implementations in 1 repository.

@russellb
Copy link
Member Author

Nice proposal, thanks @russellb.

One thing should we also include the basic implementation (which is a modified version of self-instruct (https://arxiv.org/abs/2212.10560)) currently in the CLI to be moved to the repository also? That way we extract all SDG implementation from the CLI and we have the community supported implementations in 1 repository.

Yes, I envisioned that we would move the current implementation in instructlab/instructlab repo over there. It's probably the fastest way to bootstrap the new repo, actually. Then we can enhance it from there.

@bjhargrave
Copy link
Contributor

We should also note that the sdg code is also a consumer of the schema since it much be able to understand taxonomy qna.yaml files.

I wonder if it time to have the schema repo publish instructlab-schema to PyPI so that ilab, sdg, taxonomy can all have a python dependency on the schema rather than more repos submoduling the schema repo. Using submodules was useful before we started publishing to PyPI, but perhaps now is is time to change from submodules.

@russellb
Copy link
Member Author

We should also note that the sdg code is also a consumer of the schema since it much be able to understand taxonomy qna.yaml files.

I wonder if it time to have the schema repo publish instructlab-schema to PyPI so that ilab, sdg, taxonomy can all have a python dependency on the schema rather than more repos submoduling the schema repo. Using submodules was useful before we started publishing to PyPI, but perhaps now is is time to change from submodules.

big +1 to moving away from git submodules for the schema - a package on pypi makes sense to me

@nathan-weinberg
Copy link
Member

We should also note that the sdg code is also a consumer of the schema since it much be able to understand taxonomy qna.yaml files.
I wonder if it time to have the schema repo publish instructlab-schema to PyPI so that ilab, sdg, taxonomy can all have a python dependency on the schema rather than more repos submoduling the schema repo. Using submodules was useful before we started publishing to PyPI, but perhaps now is is time to change from submodules.

big +1 to moving away from git submodules for the schema - a package on pypi makes sense to me

+1 to this as well

@bjhargrave
Copy link
Contributor

big +1 to moving away from git submodules for the schema - a package on pypi makes sense to me

+1 to this as well

OK, I'll start some work towards this end.

@bjhargrave bjhargrave added the backend InstructLab Backend Services label May 31, 2024
This document includes a proposal for a new repository which contains
a Python library focused on Synthetic Data Generation (SDG). This is a
discreet area of functionality that can evolve on its own with its own
group of contributors and maintainers.

Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb merged commit cfc1606 into instructlab:main Jun 1, 2024
4 checks passed
@russellb
Copy link
Member Author

russellb commented Jun 1, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend InstructLab Backend Services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants