From 1b40cb038e3b799bc8aafb6c0970c429ceb83925 Mon Sep 17 00:00:00 2001 From: Nathan Weinberg Date: Mon, 10 Jun 2024 15:51:51 -0400 Subject: [PATCH 1/2] Proposal for new Evaluation repo Signed-off-by: Nathan Weinberg --- .spellcheck-en-custom.txt | 9 ++++--- docs/evaluation/eval-repo.md | 49 ++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+), 3 deletions(-) create mode 100644 docs/evaluation/eval-repo.md diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 4ebd275a..0154a3b5 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -34,6 +34,7 @@ Dropdown env EP Eval +eval Excalidraw exfiltrate exfiltrating @@ -52,6 +53,7 @@ Inferencing instructlab ISA JIT +JSON Jupyter KAGGLE Kaggle @@ -63,8 +65,8 @@ LLM llms LLVM lora -md Markdownlint +md Mergify Merlinite mimimum @@ -72,10 +74,11 @@ Miniforge Mixtral MLX mlx +MMLU NVidia Nvidia -ollama Ollama +ollama orchestrator ots Pareja @@ -104,12 +107,12 @@ RX safetensors Salawu SDG -Sigstore sdg sexualized SHA Shivchander Signoff +Sigstore Srivastava subdirectory Sudalairaj diff --git a/docs/evaluation/eval-repo.md b/docs/evaluation/eval-repo.md new file mode 100644 index 00000000..acacff5f --- /dev/null +++ b/docs/evaluation/eval-repo.md @@ -0,0 +1,49 @@ +# New Repository Proposal: eval + +## Summary + +This document proposes a new repository under the `instructlab` GitHub organization: + +- `instructlab/eval` + +## Background + +The `instructlab/instructlab` repository currently includes no real implementation +of Evaluation as described by the [LAB paper](https://arxiv.org/abs/2403.01081). The +closest implementation currently in `instructlab/instructlab` via the `ilab test` command. + +`ilab test` as of this writing is only implemented for macOS with M-series chips. It uses +a JSON Lines file and a LoRA adapter to compare output of a given model before and after +LoRA training with MLX, thus the macOS M-series dependency. + +We desire to build out an implementation closer to the described evaluation in the paper, +using more high-level evaluation schemes such as +[Multi-turn Benchmark](https://arxiv.org/abs/2306.05685) for skills and +[Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (MMLU) for +knowledge. We propose a new repository to house this code that publishes a new Python +library called `instructlab-eval`. The reasoning for a new repository and library includes: + +- We expect multiple consumers of this code. The `ilab` CLI is one, but we also envision +building a REST API around it to help support scaling out this functionality on a cluster. +- We expect there is broader community interest in an open-source library and service for +evaluation. We envision this library could support other evaluation techniques over time. +- We also realize that much of model evaluation is generally useful outside the context of +InstructLab. Other libraries may emerge in the broader ecosystem that handle parts of what +we need, while this library will always remain to handle the InstructLab-specific details +of how evaluation works in our workflow. + +## Maintainers + +The initial team of maintainers for this repository will be a copy of the +`Backend Maintainers` GitHub team. + +## Alternatives Considered + +### Add to `instructlab/instructlab` + +We could add this code to the existing `instructlab/instructlab` repository. + +The primary argument against this approach is that we expect the scope of an +`instructlab-eval` library to expand beyond the scope of what would be run by the +`ilab` CLI. We instead envision a different community of contributors organizing +around Evaluation specifically. From 686870edb181a5b36424d263f3bf851c1af02028 Mon Sep 17 00:00:00 2001 From: Nathan Weinberg Date: Mon, 10 Jun 2024 21:37:04 -0400 Subject: [PATCH 2/2] Wording suggestion from Ali Co-authored-by: Ali Maredia Signed-off-by: Nathan Weinberg --- docs/evaluation/eval-repo.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/evaluation/eval-repo.md b/docs/evaluation/eval-repo.md index acacff5f..1529567c 100644 --- a/docs/evaluation/eval-repo.md +++ b/docs/evaluation/eval-repo.md @@ -16,8 +16,8 @@ closest implementation currently in `instructlab/instructlab` via the `ilab test a JSON Lines file and a LoRA adapter to compare output of a given model before and after LoRA training with MLX, thus the macOS M-series dependency. -We desire to build out an implementation closer to the described evaluation in the paper, -using more high-level evaluation schemes such as +We desire to build out a library for methods that satisfy the evaluation described in the +paper, using more high-level evaluation schemes such as [Multi-turn Benchmark](https://arxiv.org/abs/2306.05685) for skills and [Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (MMLU) for knowledge. We propose a new repository to house this code that publishes a new Python