Skip to content

Commit

Permalink
New repository proposal: sdg
Browse files Browse the repository at this point in the history
This document includes a proposal for a new repository which contains
a Python library focused on Synthetic Data Generation (SDG). This is a
discreet area of functionality that can evolve on its own with its own
group of contributors and maintainers.

Signed-off-by: Russell Bryant <[email protected]>
  • Loading branch information
russellb committed May 22, 2024
1 parent 26d7e07 commit 44aa5fb
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 0 deletions.
1 change: 1 addition & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ RX
safetensors
Salawu
SDG
sdg
sexualized
SHA
Shivchander
Expand Down
36 changes: 36 additions & 0 deletions docs/sdg-repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# New Repository Proposal: sdg

## Summary

This document proposes a new repository under the `instructlab` GitHub organization:

- `instructlab/sdg`

## Background

The `instructlab/instructlab` repository includes a basic implementation of
Synthetic Data Generation (SDG). This implementation does not implement the full
approach as described by the [LAB paper](https://arxiv.org/abs/2403.01081).

We desire to build out a more complete implementation of SDG that is more in
line with the LAB methodology. We propose a new repository to house this code
that publishes a new Python library called `instructlab-sdg`. The reasoning for
a new repository and library includes:

- We expect multiple consumers of this code. The `ilab` CLI is one, but we also
envision building a REST API around it to help support scaling out this
functionality on a cluster.
- We expect there is broader community interest in an open-source library and
service for synthetic data generation. We envision this library could support
other data generation techniques over time.

## Alternatives Considered

### Add to `instructlab/instructlab`

We could add this code to the existing `instructlab/instructlab` repository.

The primary argument against this approach is that we expect the scope of an
`instructlab-sdg` library to expand beyond the scope of what would be run by the
`ilab` CLI. We instead envision a different community of contributors organizing
around SDG specifically.

0 comments on commit 44aa5fb

Please sign in to comment.