From 44aa5fbd43f640fa688662e258351c1c3acfd7a8 Mon Sep 17 00:00:00 2001 From: Russell Bryant <rbryant@redhat.com> Date: Wed, 22 May 2024 16:14:26 -0400 Subject: [PATCH] New repository proposal: sdg This document includes a proposal for a new repository which contains a Python library focused on Synthetic Data Generation (SDG). This is a discreet area of functionality that can evolve on its own with its own group of contributors and maintainers. Signed-off-by: Russell Bryant <rbryant@redhat.com> --- .spellcheck-en-custom.txt | 1 + docs/sdg-repo.md | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) create mode 100644 docs/sdg-repo.md diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 3f52ef6f..c362d6da 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -89,6 +89,7 @@ RX safetensors Salawu SDG +sdg sexualized SHA Shivchander diff --git a/docs/sdg-repo.md b/docs/sdg-repo.md new file mode 100644 index 00000000..4895c2ed --- /dev/null +++ b/docs/sdg-repo.md @@ -0,0 +1,36 @@ +# New Repository Proposal: sdg + +## Summary + +This document proposes a new repository under the `instructlab` GitHub organization: + +- `instructlab/sdg` + +## Background + +The `instructlab/instructlab` repository includes a basic implementation of +Synthetic Data Generation (SDG). This implementation does not implement the full +approach as described by the [LAB paper](https://arxiv.org/abs/2403.01081). + +We desire to build out a more complete implementation of SDG that is more in +line with the LAB methodology. We propose a new repository to house this code +that publishes a new Python library called `instructlab-sdg`. The reasoning for +a new repository and library includes: + +- We expect multiple consumers of this code. The `ilab` CLI is one, but we also + envision building a REST API around it to help support scaling out this + functionality on a cluster. +- We expect there is broader community interest in an open-source library and + service for synthetic data generation. We envision this library could support + other data generation techniques over time. + +## Alternatives Considered + +### Add to `instructlab/instructlab` + +We could add this code to the existing `instructlab/instructlab` repository. + +The primary argument against this approach is that we expect the scope of an +`instructlab-sdg` library to expand beyond the scope of what would be run by the +`ilab` CLI. We instead envision a different community of contributors organizing +around SDG specifically.