-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc for dedicated local storage in ilab cli #104
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# InstructLab CLI Local Storage Proposal | ||
|
||
## Why is this needed? | ||
|
||
For tools to be useful to the user, they generally need to work on and know about data pertinent to the user. For example, `podman` and `docker` track image blobs, `kubectl` tracks cluster credentials, etc. | ||
|
||
The InstructLab CLI manages a number of settings today such as the teacher model, taxonomy location, inference model, etc. using a local file known as `config.yaml`. | ||
|
||
While this method allows ilab to avoid additional complexity of managing a dedicated directory, it’s limited by its ability to scale beyond constraining the user to run `ilab` in a single directory. | ||
Since the new ilab CLI will need to not only manage datasets, downloaded models, and any intermediary files, it'll also have to make it simple for the user to orchestrate these things. | ||
nathan-weinberg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
We presently store these paths directly in the config, however this | ||
isn't sustainable as data is frequently moved around unbeknownst to the program. This leaves the user with two choices: either update | ||
the config to correctly point to either a relative path in the format | ||
of `../../uncle-dir/dataset.jsonl`, or have to specify an absolute path. Both options are equally flexible and inconvenient for the user to produce. | ||
|
||
|
||
Check failure on line 17 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
## What's being proposed? | ||
|
||
We propose the following: | ||
|
||
1. `ilab` will have a dedicated config directory seated at `~/.config/instructlab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if Windows support is added in the future. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is your intent to support XDG?: Not arguing that you should necessarily but if you intended to I would change the text to be seated at $XDG_CONFIG_HOME/instructlab (.local/share/instructlab) (And XDG_DATA_HOME later on) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that's what I was thinking of. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I strongly recommend to use https://pypi.org/project/platformdirs/ . It has the correct defaults for all important platforms. I also like to point out that the use of XDG directories does not work well with systemd services. @mairin just mentioned that she wants systemd service support for InstructLab. XDG directories are problematic, because they are designed for human end-users with a home directory. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tiran Thanks for pointing this out! I had initially suggested XDG thinking of the laptop-targeted experience. But if we want a systemd service we'll need to use system directories. Ideally we'd use XDG for the laptop-target user experience and system-wide directories for the big mama GPU hw experience. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ^ to this end, @tiran suggests |
||
2. `ilab` will have a dedicated program files directory seated at `~/.local/share/instructlab` which will store model checkpoints, downloaded models, etc. | ||
Check failure on line 23 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintTrailing spaces
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To carry through on the Windows spec I would say %LOCALAPPDATA% here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification! |
||
3. Expectation of having config migrations in the future as changes are made to the config schema. | ||
|
||
|
||
Check failure on line 26 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
## Config Directory | ||
|
||
The `ilab` CLI will have a dedicated config directory rooted at | ||
`~/.config/instructlab`. | ||
|
||
This will just house the `config.yaml` as-is, and any other | ||
configuration files that should live here. | ||
|
||
|
||
Check failure on line 35 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
## Data Directory | ||
|
||
`ilab` needs to generate, store, and manage different pieces of data over the course of its lifecycle. This includes model checkpoints, generated data, and various logs from processes such as evaluation. | ||
|
||
Files like the model bases are large and take a significant amount of time to download. Similarly, the generated datasets are expensive to re-generate and should be easier to manage. | ||
|
||
For these reasons, we propose that this data will have a permanent home at `~/.local/share/instructlab`. | ||
Check failure on line 42 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintTrailing spaces
|
||
|
||
This is a rough outline of all of the data pieces consumed and generated by different points in the process: | ||
|
||
| phase | consumed data | generated data | | ||
|-------|---------------|----------------| | ||
| SDG | QnA files from taxonomy, teacher model checkpoint | knowledge dataset for training, skills dataset for training | | ||
| Training | Input dataset, base model | Model checkpoints, processed dataset to be consumed during training | | ||
| Evaluation | Model checkpoint being evaluated | JSON data / logs containing the score | | ||
| Publishing | Model checkpoints to be published | n/a | | ||
|
||
|
||
Check failure on line 53 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
Based on this, it's evident that we need a few different subdirectories: | ||
|
||
| path | description | | ||
|------|-------------| | ||
| `base_models/` | Stores the base models that we download | | ||
| `checkpoints/` | Houses the checkpoints we generate during training. Can potentially split these out further into `checkpoints/phase00`, `checkpoints/phase05`, and `checkpoints/phase10` | | ||
nathan-weinberg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| `datasets` | Would contain the generated datasets. Each item here would be a directory consisting of: `train.jsonl` and `test.jsonl` | | ||
| `logs` | Would contain logs from various `ilab` runs | | ||
| `eval_data` | Contains scores from various evaluation runs for a given model ref | | ||
|
||
|
||
Check failure on line 64 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
## How ilab would manage this data | ||
|
||
There should be a ***simple and intuitive*** way to view & manage all of this data. | ||
|
||
The basic idea is that each piece of information can be discovered through a `list` command in some form. Each item in the listing will have the corresponding associated information, along with a unique reference that the user can use when interacting with subsequent commands. We could also provide a `--show-paths` or `--wide` flag which prints out where all of these data pieces are actually stored on the machine. | ||
|
||
|
||
Check failure on line 71 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
### Manging model data | ||
|
||
|
||
Check failure on line 74 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintMultiple consecutive blank lines
|
||
New model checkpoints would be generated during training and | ||
saved to the `checkpoints/` directory along with a unique ref. | ||
Check failure on line 76 in docs/cli/ilab-local-config.md GitHub Actions / markdown-lintTrailing spaces
|
||
When base models are downloaded, they would be saved in the `base_models` directory with their basename being the unique ref. | ||
|
||
|
||
An `ilab model list` command would be added which allows the user to list out existing models. This would allow users to view all models, base models, trained checkpoints, etc. Additionally, the user could further filter out trained checkpoints based on which training phase they were generated in, i.e. `phase00`, `phase05`, and `phase10`. | ||
|
||
We can iterate on the behavior and optimize for UX, but the | ||
idea is you should be able to run either: `ilab model list` or | ||
`ilab list models` which produces a full list of all models on the local system. | ||
|
||
The output of `ilab model list` would look like the following: | ||
|
||
``` | ||
MODEL ID BASE MODEL SIZE DATA TYPE DATASET TRAINED WITH | ||
abcde ibm-granite/granite-7b-base 28Gi FP32 phase00-dataset-id | ||
cbdea abcde 28Gi FP32 phase05-dataset-id | ||
``` | ||
|
||
|
||
|
||
Finally, we would add a new `ilab model delete <ref>` command | ||
which would enable us to pass a reference to the model we want | ||
to delete. | ||
|
||
|
||
### Managing datasets | ||
|
||
New datasets would be generated through `ilab generate` (or whatever the current equivalent is) and the corresponding dataset would be saved under the `datasets/` directory. Each model would also have a unique ref associated with it. | ||
|
||
Following this, we could list out the datasets with `ilab data list` and further filter them out with either `knowledge` and `skills`, or `phase00`, `phase05`, and `phase10` depending on which SDG scheme is being used. | ||
|
||
Here's an example output from `ilab data list` | ||
|
||
``` | ||
DATASET ID PHASE TYPE SIZE NUM SAMPLES TEACHER MODEL | ||
phase00-dataset-id knowledge_base 500MB 650k mistralai/mixtral-8x7B | ||
phase05-dataset-id knowledge 230MB 330k mistralai/mixtral-8x7B | ||
phase10-dataset-id skils 500MB 550k mistralai/mixtral-8x7B | ||
``` | ||
|
||
An `ilab data delete <ref>` command would be added as well | ||
so that the user can easily delete generated datasets. | ||
|
||
Additionally, the `ilab model train` command would be able | ||
to receive a ref to a dataset for easy access. | ||
|
||
|
||
### Managing Eval Runs | ||
|
||
When we run evaluations, we expect to receive a certain value which is tied to a model checkpoint. | ||
|
||
Under this new framework, the evaluation outputs could be saved under the `eval_data` subdirectory, where `ilab` could later use to get the score for a particular model. | ||
RobotSail marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This has already been partly implemented in [PR#1369](https://github.com/instructlab/instructlab/pull/1369). | ||
|
||
When we evaluate a particular model with `ilab eval <model-ref>`, we would then save the corresponding score in this new directory and link it back to the model that was evaluated and the metric. | ||
|
||
By linking a given model with an evaluation score, we could provide an `ilab eval list` command that can output scores | ||
for particular models. | ||
|
||
Here's a sample output of the `ilab eval list` command: | ||
|
||
``` | ||
MODEL REF BASE MODEL EVAL SCORE METRIC EVAL DATE | ||
abcde ibm-granite/granite-7b-base 84.2% MMLU 2024-06-24 | ||
cbdea abcde 78.9% MMLU 2024-06-23 | ||
fghij ibm-granite/granite-7b-base 7.89 MT-Bench 2024-06-22 | ||
``` | ||
|
||
Since this would be JSON data, we don't expect it to be very expensive or consume a lot of space, so clearing it should not be a priority. | ||
|
||
|
||
We could provide an `ilab eval clear` function that erases all of the evaluation data. We could also generate refs for the eval runs as well, and clear them using `ilab eval clear <eval-ref>`. | ||
|
||
|
||
## Scope of this proposal | ||
|
||
|
||
This proposal provides a long-term goal of numerous components and how we expect them to interact with each other. | ||
|
||
The main point here is to move the config to live in a single place so that the `ilab` command can be executed from any directory with confidence. In addition, we require the use of refs and CRUD-style commands to simplify the management and interaction of data generated by `ilab`. | ||
|
||
The steps outlined to implement this are: | ||
|
||
1. Implement logic for the `~/.config/instructlab` and `~/.local/share/instructlab` directories, and expect the config to live and be read from there unless another config is manually specified. For simplicity, if the config we read in is located at `~/.config/instructlab` - the default location - then we'll assume relative paths should be resolved with respect to the data directory `~/.local/share/instructlab`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should it be that relative paths are resolved with respect to the config directory? |
||
1. Implement logic to the existing commands so that datasets, models, and eval scores can be read from `ilab list` | ||
1. Implement ref logic and update all existing commands to be able to resolve objects from refs | ||
|
||
|
||
Based on these steps, #1 would be the trickiest because we assume that the paths referenced in the config file are relative to the calling process. | ||
|
||
One approach here could be removing these paths from the default and forcing the user to specify them relatively. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am having trouble understanding what you mean on this sentence There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @n1hility My wording here is bad but the idea is simple:
|
||
|
||
Another approach would be implementing the list commands at the same time as #1 and setting it up so that the relative paths passed to `ilab` are relative ***to the `~/.local/share/instructlab` directory*** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any compatibility with existing configs? For example it looks in the current dir and if there is a config.yml expects abs paths, or is this an intentional major change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would the experience be like without #3 (implementing refs) in your list? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would list tell you were a file was in that case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be a breaking change then but we'd need to make it. We can maintain compatibility by allowing users to provide a |
||
|
||
|
||
## Follow-up work | ||
|
||
In the future, we may want to have config migration and evolution so that we can be forward-compatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CLI itself doesn't manage any of this. It's user config. It generates an initial file for you, but never updates it from there. I only bring this up because user config isn't the same as data that is never touched directly by a user. (config vs data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, I would say it makes a stronger point for the proposed structure of the
~/.local/share
directory.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be so @RobotSail but as @russellb mentioned the CLI doesn't manage this.