From 44aa5fbd43f640fa688662e258351c1c3acfd7a8 Mon Sep 17 00:00:00 2001
From: Russell Bryant <rbryant@redhat.com>
Date: Wed, 22 May 2024 16:14:26 -0400
Subject: [PATCH] New repository proposal: sdg

This document includes a proposal for a new repository which contains
a Python library focused on Synthetic Data Generation (SDG). This is a
discreet area of functionality that can evolve on its own with its own
group of contributors and maintainers.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
---
 .spellcheck-en-custom.txt |  1 +
 docs/sdg-repo.md          | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)
 create mode 100644 docs/sdg-repo.md

diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt
index 3f52ef6f..c362d6da 100644
--- a/.spellcheck-en-custom.txt
+++ b/.spellcheck-en-custom.txt
@@ -89,6 +89,7 @@ RX
 safetensors
 Salawu
 SDG
+sdg
 sexualized
 SHA
 Shivchander
diff --git a/docs/sdg-repo.md b/docs/sdg-repo.md
new file mode 100644
index 00000000..4895c2ed
--- /dev/null
+++ b/docs/sdg-repo.md
@@ -0,0 +1,36 @@
+# New Repository Proposal: sdg
+
+## Summary
+
+This document proposes a new repository under the `instructlab` GitHub organization:
+
+- `instructlab/sdg`
+
+## Background
+
+The `instructlab/instructlab` repository includes a basic implementation of
+Synthetic Data Generation (SDG). This implementation does not implement the full
+approach as described by the [LAB paper](https://arxiv.org/abs/2403.01081).
+
+We desire to build out a more complete implementation of SDG that is more in
+line with the LAB methodology. We propose a new repository to house this code
+that publishes a new Python library called `instructlab-sdg`.  The reasoning for
+a new repository and library includes:
+
+- We expect multiple consumers of this code. The `ilab` CLI is one, but we also
+  envision building a REST API around it to help support scaling out this
+  functionality on a cluster.
+- We expect there is broader community interest in an open-source library and
+  service for synthetic data generation. We envision this library could support
+  other data generation techniques over time.
+
+## Alternatives Considered
+
+### Add to `instructlab/instructlab`
+
+We could add this code to the existing `instructlab/instructlab` repository.
+
+The primary argument against this approach is that we expect the scope of an
+`instructlab-sdg` library to expand beyond the scope of what would be run by the
+`ilab` CLI. We instead envision a different community of contributors organizing
+around SDG specifically.