-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This document includes a proposal for a new repository which contains a Python library focused on Synthetic Data Generation (SDG). This is a discreet area of functionality that can evolve on its own with its own group of contributors and maintainers. Signed-off-by: Russell Bryant <[email protected]>
- Loading branch information
Showing
2 changed files
with
37 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,6 +89,7 @@ RX | |
safetensors | ||
Salawu | ||
SDG | ||
sdg | ||
sexualized | ||
SHA | ||
Shivchander | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# New Repository Proposal: sdg | ||
|
||
## Summary | ||
|
||
This document proposes a new repository under the `instructlab` GitHub organization: | ||
|
||
- `instructlab/sdg` | ||
|
||
## Background | ||
|
||
The `instructlab/instructlab` repository includes a basic implementation of | ||
Synthetic Data Generation (SDG). This implementation does not implement the full | ||
approach as described by the [LAB paper](https://arxiv.org/abs/2403.01081). | ||
|
||
We desire to build out a more complete implementation of SDG that is more in | ||
line with the LAB methodology. We propose a new repository to house this code | ||
that publishes a new Python library called `instructlab-sdg`. The reasoning for | ||
a new repository and library includes: | ||
|
||
- We expect multiple consumers of this code. The `ilab` CLI is one, but we also | ||
envision building a REST API around it to help support scaling out this | ||
functionality on a cluster. | ||
- We expect there is broader community interest in an open-source library and | ||
service for synthetic data generation. We envision this library could support | ||
other data generation techniques over time. | ||
|
||
## Alternatives Considered | ||
|
||
### Add to `instructlab/instructlab` | ||
|
||
We could add this code to the existing `instructlab/instructlab` repository. | ||
|
||
The primary argument against this approach is that we expect the scope of an | ||
`instructlab-sdg` library to expand beyond the scope of what would be run by the | ||
`ilab` CLI. We instead envision a different community of contributors organizing | ||
around SDG specifically. |