From 1b40cb038e3b799bc8aafb6c0970c429ceb83925 Mon Sep 17 00:00:00 2001
From: Nathan Weinberg <nweinber@redhat.com>
Date: Mon, 10 Jun 2024 15:51:51 -0400
Subject: [PATCH 1/2] Proposal for new Evaluation repo

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
---
 .spellcheck-en-custom.txt    |  9 ++++---
 docs/evaluation/eval-repo.md | 49 ++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 3 deletions(-)
 create mode 100644 docs/evaluation/eval-repo.md

diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt
index 4ebd275a..0154a3b5 100644
--- a/.spellcheck-en-custom.txt
+++ b/.spellcheck-en-custom.txt
@@ -34,6 +34,7 @@ Dropdown
 env
 EP
 Eval
+eval
 Excalidraw
 exfiltrate
 exfiltrating
@@ -52,6 +53,7 @@ Inferencing
 instructlab
 ISA
 JIT
+JSON
 Jupyter
 KAGGLE
 Kaggle
@@ -63,8 +65,8 @@ LLM
 llms
 LLVM
 lora
-md
 Markdownlint
+md
 Mergify
 Merlinite
 mimimum
@@ -72,10 +74,11 @@ Miniforge
 Mixtral
 MLX
 mlx
+MMLU
 NVidia
 Nvidia
-ollama
 Ollama
+ollama
 orchestrator
 ots
 Pareja
@@ -104,12 +107,12 @@ RX
 safetensors
 Salawu
 SDG
-Sigstore
 sdg
 sexualized
 SHA
 Shivchander
 Signoff
+Sigstore
 Srivastava
 subdirectory
 Sudalairaj
diff --git a/docs/evaluation/eval-repo.md b/docs/evaluation/eval-repo.md
new file mode 100644
index 00000000..acacff5f
--- /dev/null
+++ b/docs/evaluation/eval-repo.md
@@ -0,0 +1,49 @@
+# New Repository Proposal: eval
+
+## Summary
+
+This document proposes a new repository under the `instructlab` GitHub organization:
+
+- `instructlab/eval`
+
+## Background
+
+The `instructlab/instructlab` repository currently includes no real implementation
+of Evaluation as described by the [LAB paper](https://arxiv.org/abs/2403.01081). The
+closest implementation currently in `instructlab/instructlab` via the `ilab test` command.
+
+`ilab test` as of this writing is only implemented for macOS with M-series chips. It uses
+a JSON Lines file and a LoRA adapter to compare output of a given model before and after
+LoRA training with MLX, thus the macOS M-series dependency.
+
+We desire to build out an implementation closer to the described evaluation in the paper,
+using more high-level evaluation schemes such as
+[Multi-turn Benchmark](https://arxiv.org/abs/2306.05685) for skills and
+[Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (MMLU) for
+knowledge. We propose a new repository to house this code that publishes a new Python
+library called `instructlab-eval`. The reasoning for a new repository and library includes:
+
+- We expect multiple consumers of this code. The `ilab` CLI is one, but we also envision
+building a REST API around it to help support scaling out this functionality on a cluster.
+- We expect there is broader community interest in an open-source library and service for
+evaluation. We envision this library could support other evaluation techniques over time.
+- We also realize that much of model evaluation is generally useful outside the context of
+InstructLab. Other libraries may emerge in the broader ecosystem that handle parts of what
+we need, while this library will always remain to handle the InstructLab-specific details
+of how evaluation works in our workflow.
+
+## Maintainers
+
+The initial team of maintainers for this repository will be a copy of the
+`Backend Maintainers` GitHub team.
+
+## Alternatives Considered
+
+### Add to `instructlab/instructlab`
+
+We could add this code to the existing `instructlab/instructlab` repository.
+
+The primary argument against this approach is that we expect the scope of an
+`instructlab-eval` library to expand beyond the scope of what would be run by the
+`ilab` CLI. We instead envision a different community of contributors organizing
+around Evaluation specifically.

From 686870edb181a5b36424d263f3bf851c1af02028 Mon Sep 17 00:00:00 2001
From: Nathan Weinberg <nweinber@redhat.com>
Date: Mon, 10 Jun 2024 21:37:04 -0400
Subject: [PATCH 2/2] Wording suggestion from Ali

Co-authored-by: Ali Maredia <amaredia@redhat.com>
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
---
 docs/evaluation/eval-repo.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/evaluation/eval-repo.md b/docs/evaluation/eval-repo.md
index acacff5f..1529567c 100644
--- a/docs/evaluation/eval-repo.md
+++ b/docs/evaluation/eval-repo.md
@@ -16,8 +16,8 @@ closest implementation currently in `instructlab/instructlab` via the `ilab test
 a JSON Lines file and a LoRA adapter to compare output of a given model before and after
 LoRA training with MLX, thus the macOS M-series dependency.
 
-We desire to build out an implementation closer to the described evaluation in the paper,
-using more high-level evaluation schemes such as
+We desire to build out a library for methods that satisfy the evaluation described in the
+paper, using more high-level evaluation schemes such as
 [Multi-turn Benchmark](https://arxiv.org/abs/2306.05685) for skills and
 [Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (MMLU) for
 knowledge. We propose a new repository to house this code that publishes a new Python