instructlab · RobotSail · Jun 24, 2024 · Jun 25, 2024 · Jun 25, 2024 · russellb
diff --git a/docs/cli/ilab-local-config.md b/docs/cli/ilab-local-config.md
@@ -0,0 +1,173 @@
+# InstructLab CLI Local Storage Proposal
+
+## Why is this needed?
+
+For tools to be useful to the user, they generally need to work on and know about data pertinent to the user. For example, `podman` and `docker` track image blobs, `kubectl` tracks cluster credentials, etc.
+
+The InstructLab CLI manages a number of settings today such as the teacher model, taxonomy location, inference model, etc. using a local file known as `config.yaml`.
+
+While this method allows ilab to avoid additional complexity of managing a dedicated directory, it’s limited by its ability to scale beyond constraining the user to run `ilab` in a single directory.
+Since the new ilab CLI will need to not only manage datasets, downloaded models, and any intermediary files, it'll also have to make it simple for the user to orchestrate these things.
+
+We presently store these paths directly in the config, however this
+isn't sustainable as data is frequently moved around unbeknownst to the program. This leaves the user with two choices: either update
+the config to correctly point to either a relative path in the format
+of `../../uncle-dir/dataset.jsonl`, or have to specify an absolute path. Both options are equally flexible and inconvenient for the user to produce.
+
+
+## What's being proposed?
+
+We propose the following:
+
+1. `ilab` will have a dedicated config directory seated at `~/.config/instructlab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if Windows support is added in the future.
+2. `ilab` will have a dedicated program files directory seated at `~/.local/share/instructlab` which will store model checkpoints, downloaded models, etc. 
+3. Expectation of having config migrations in the future as changes are made to the config schema.
+
+
+## Config Directory
+
+The `ilab` CLI will have a dedicated config directory rooted at
+`~/.config/instructlab`.
+
+This will just house the `config.yaml` as-is, and any other
+configuration files that should live here.
+
+
+## Data Directory
+
+`ilab` needs to generate, store, and manage different pieces of data over the course of its lifecycle. This includes model checkpoints, generated data, and various logs from processes such as evaluation.
+
+Files like the model bases are large and take a significant amount of time to download. Similarly, the generated datasets are expensive to re-generate and should be easier to manage.
+
+For these reasons, we propose that this data will have a permanent home at `~/.local/share/instructlab`. 
+
+This is a rough outline of all of the data pieces consumed and generated by different points in the process:
+
+| phase | consumed data | generated data |
+|-------|---------------|----------------|
+| SDG | QnA files from taxonomy, teacher model checkpoint | knowledge dataset for training, skills dataset for training |
+| Training | Input dataset, base model | Model checkpoints, processed dataset to be consumed during training |
+| Evaluation | Model checkpoint being evaluated | JSON data / logs containing the score |
+| Publishing | Model checkpoints to be published | n/a |
+
+
+Based on this, it's evident that we need a few different subdirectories:
+
+| path | description |
+|------|-------------|
+| `base_models/` | Stores the base models that we download |
+| `checkpoints/` | Houses the checkpoints we generate during training. Can potentially split these out further into `checkpoints/phase00`, `checkpoints/phase05`, and `checkpoints/phase10` |
+| `datasets` | Would contain the generated datasets. Each item here would be a directory consisting of: `train.jsonl` and `test.jsonl` |
+| `logs` | Would contain logs from various `ilab` runs |
+| `eval_data` | Contains scores from various evaluation runs for a given model ref |
+
+
+## How ilab would manage this data
+
+There should be a ***simple and intuitive*** way to view & manage all of this data.
+
+The basic idea is that each piece of information can be discovered through a `list` command in some form. Each item in the listing will have the corresponding associated information, along with a unique reference that the user can use when interacting with subsequent commands. We could also provide a `--show-paths` or `--wide` flag which prints out where all of these data pieces are actually stored on the machine.
+
+
+### Manging model data
+
+
+New model checkpoints would be generated during training and
+saved to the `checkpoints/` directory along with a unique ref. 
+When base models are downloaded, they would be saved in the `base_models` directory with their basename being the unique ref.
+
+
+An `ilab model list` command would be added which allows the user to list out existing models. This would allow users to view all models, base models, trained checkpoints, etc. Additionally, the user could further filter out trained checkpoints based on which training phase they were generated in, i.e. `phase00`, `phase05`, and `phase10`.
+
+We can iterate on the behavior and optimize for UX, but the
+idea is you should be able to run either: `ilab model list` or
+`ilab list models` which produces a full list of all models on the local system.
+
+The output of `ilab model list` would look like the following:
+
+```
+MODEL ID    BASE MODEL                    SIZE    DATA TYPE    DATASET TRAINED WITH
+abcde       ibm-granite/granite-7b-base   28Gi    FP32         phase00-dataset-id
+cbdea       abcde                         28Gi    FP32         phase05-dataset-id
+```
+
+
+
+Finally, we would add a new `ilab model delete <ref>`  command 
+which would enable us to pass a reference to the model we want
+to delete.
+
+
+### Managing datasets
+
+New datasets would be generated through `ilab generate` (or whatever the current equivalent is) and the corresponding dataset would be saved under the `datasets/` directory. Each model would also have a unique ref associated with it.
+
+Following this, we could list out the datasets with `ilab data list` and further filter them out with either `knowledge` and `skills`, or `phase00`, `phase05`, and `phase10` depending on which SDG scheme is being used.
+
+Here's an example output from `ilab data list`
+
+```
+DATASET ID            PHASE TYPE        SIZE    NUM SAMPLES    TEACHER MODEL
+phase00-dataset-id    knowledge_base    500MB   650k           mistralai/mixtral-8x7B
+phase05-dataset-id    knowledge         230MB   330k           mistralai/mixtral-8x7B
+phase10-dataset-id    skils             500MB   550k           mistralai/mixtral-8x7B
+```
+
+An `ilab data delete <ref>` command would be added as well
+so that the user can easily delete generated datasets.
+
+Additionally, the `ilab model train` command would be able
+to receive a ref to a dataset for easy access.
+
+
+### Managing Eval Runs
+
+When we run evaluations, we expect to receive a certain value which is tied to a model checkpoint.
+
+Under this new framework, the evaluation outputs could be saved under the `eval_data` subdirectory, where `ilab` could later use to get the score for a particular model.
+This has already been partly implemented in [PR#1369](https://github.com/instructlab/instructlab/pull/1369).
+
+When we evaluate a particular model with `ilab eval <model-ref>`, we would then save the corresponding score in this new directory and link it back to the model that was evaluated and the metric.
+
+By linking a given model with an evaluation score, we could provide an `ilab eval list` command that can output scores
+for particular models.
+
+Here's a sample output of the `ilab eval list` command:
+
+```
+MODEL REF   BASE MODEL                    EVAL SCORE  METRIC     EVAL DATE 
+abcde      	ibm-granite/granite-7b-base   84.2%       MMLU       2024-06-24
+cbdea      	abcde                         78.9%       MMLU       2024-06-23
+fghij       ibm-granite/granite-7b-base   7.89        MT-Bench   2024-06-22
+```
+
+Since this would be JSON data, we don't expect it to be very expensive or consume a lot of space, so clearing it should not be a priority. 
+
+
+We could provide an `ilab eval clear` function that erases all of the evaluation data. We could also generate refs for the eval runs as well, and clear them using `ilab eval clear <eval-ref>`.
+
+
+## Scope of this proposal
+
+
+This proposal provides a long-term goal of numerous components and how we expect them to interact with each other.
+
+The main point here is to move the config to live in a single place so that the `ilab` command can be executed from any directory with confidence. In addition, we require the use of refs and CRUD-style commands to simplify the management and interaction of data generated by `ilab`.
+
+The steps outlined to implement this are:
+
+1. Implement logic for the `~/.config/instructlab` and `~/.local/share/instructlab` directories, and expect the config to live and be read from there unless another config is manually specified. For simplicity, if the config we read in is located at `~/.config/instructlab` - the default location - then we'll assume relative paths should be resolved with respect to the data directory `~/.local/share/instructlab`.
+1. Implement logic to the existing commands so that datasets, models, and eval scores can be read from `ilab list`
+1. Implement ref logic and update all existing commands to be able to resolve objects from refs
+
+
+Based on these steps, #1 would be the trickiest because we assume that the paths referenced in the config file are relative to the calling process. 
+
+One approach here could be removing these paths from the default and forcing the user to specify them relatively.
+
+Another approach would be implementing the list commands at the same time as #1 and setting it up so that the relative paths passed to `ilab` are relative ***to the `~/.local/share/instructlab` directory***
+
+
+## Follow-up work
+
+In the future, we may want to have config migration and evolution so that we can be forward-compatible.