Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for dedicated local storage in ilab cli #104

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 173 additions & 0 deletions docs/cli/ilab-local-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# InstructLab CLI Local Storage Proposal

## Why is this needed?

For tools to be useful to the user, they generally need to work on and know about data pertinent to the user. For example, `podman` and `docker` track image blobs, `kubectl` tracks cluster credentials, etc.

The InstructLab CLI manages a number of settings today such as the teacher model, taxonomy location, inference model, etc. using a local file known as `config.yaml`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI itself doesn't manage any of this. It's user config. It generates an initial file for you, but never updates it from there. I only bring this up because user config isn't the same as data that is never touched directly by a user. (config vs data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I would say it makes a stronger point for the proposed structure of the ~/.local/share directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be so @RobotSail but as @russellb mentioned the CLI doesn't manage this.


While this method allows ilab to avoid additional complexity of managing a dedicated directory, it’s limited by its ability to scale beyond constraining the user to run `ilab` in a single directory.
Since the new ilab CLI will need to not only manage datasets, downloaded models, and any intermediary files, it'll also have to make it simple for the user to orchestrate these things.
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved

We presently store these paths directly in the config, however this
isn't sustainable as data is frequently moved around unbeknownst to the program. This leaves the user with two choices: either update
the config to correctly point to either a relative path in the format
of `../../uncle-dir/dataset.jsonl`, or have to specify an absolute path. Both options are equally flexible and inconvenient for the user to produce.


Check failure on line 17 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:17 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## What's being proposed?

We propose the following:

1. `ilab` will have a dedicated config directory seated at `~/.config/instructlab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if Windows support is added in the future.
Copy link
Member

@n1hility n1hility Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your intent to support XDG?:
https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html#introduction

Not arguing that you should necessarily but if you intended to I would change the text to be seated at $XDG_CONFIG_HOME/instructlab (.local/share/instructlab) (And XDG_DATA_HOME later on)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I was thinking of.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly recommend to use https://pypi.org/project/platformdirs/ . It has the correct defaults for all important platforms.

I also like to point out that the use of XDG directories does not work well with systemd services. @mairin just mentioned that she wants systemd service support for InstructLab. XDG directories are problematic, because they are designed for human end-users with a home directory. systemd --system need an entirely different directory structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiran Thanks for pointing this out! I had initially suggested XDG thinking of the laptop-targeted experience. But if we want a systemd service we'll need to use system directories. Ideally we'd use XDG for the laptop-target user experience and system-wide directories for the big mama GPU hw experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ to this end, @tiran suggests I would store the paths in config.yaml and have a option to either use XDG or a common base directory for all paths.

2. `ilab` will have a dedicated program files directory seated at `~/.local/share/instructlab` which will store model checkpoints, downloaded models, etc.

Check failure on line 23 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/cli/ilab-local-config.md:23:155 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To carry through on the Windows spec I would say %LOCALAPPDATA% here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

3. Expectation of having config migrations in the future as changes are made to the config schema.


Check failure on line 26 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:26 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## Config Directory

The `ilab` CLI will have a dedicated config directory rooted at
`~/.config/instructlab`.

This will just house the `config.yaml` as-is, and any other
configuration files that should live here.


Check failure on line 35 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:35 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## Data Directory

`ilab` needs to generate, store, and manage different pieces of data over the course of its lifecycle. This includes model checkpoints, generated data, and various logs from processes such as evaluation.

Files like the model bases are large and take a significant amount of time to download. Similarly, the generated datasets are expensive to re-generate and should be easier to manage.

For these reasons, we propose that this data will have a permanent home at `~/.local/share/instructlab`.

Check failure on line 42 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/cli/ilab-local-config.md:42:105 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md

This is a rough outline of all of the data pieces consumed and generated by different points in the process:

| phase | consumed data | generated data |
|-------|---------------|----------------|
| SDG | QnA files from taxonomy, teacher model checkpoint | knowledge dataset for training, skills dataset for training |
| Training | Input dataset, base model | Model checkpoints, processed dataset to be consumed during training |
| Evaluation | Model checkpoint being evaluated | JSON data / logs containing the score |
| Publishing | Model checkpoints to be published | n/a |


Check failure on line 53 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:53 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
Based on this, it's evident that we need a few different subdirectories:

| path | description |
|------|-------------|
| `base_models/` | Stores the base models that we download |
| `checkpoints/` | Houses the checkpoints we generate during training. Can potentially split these out further into `checkpoints/phase00`, `checkpoints/phase05`, and `checkpoints/phase10` |
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved
| `datasets` | Would contain the generated datasets. Each item here would be a directory consisting of: `train.jsonl` and `test.jsonl` |
| `logs` | Would contain logs from various `ilab` runs |
| `eval_data` | Contains scores from various evaluation runs for a given model ref |


Check failure on line 64 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:64 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## How ilab would manage this data

There should be a ***simple and intuitive*** way to view & manage all of this data.

The basic idea is that each piece of information can be discovered through a `list` command in some form. Each item in the listing will have the corresponding associated information, along with a unique reference that the user can use when interacting with subsequent commands. We could also provide a `--show-paths` or `--wide` flag which prints out where all of these data pieces are actually stored on the machine.


Check failure on line 71 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:71 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
### Manging model data


Check failure on line 74 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:74 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
New model checkpoints would be generated during training and
saved to the `checkpoints/` directory along with a unique ref.

Check failure on line 76 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/cli/ilab-local-config.md:76:63 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
When base models are downloaded, they would be saved in the `base_models` directory with their basename being the unique ref.


An `ilab model list` command would be added which allows the user to list out existing models. This would allow users to view all models, base models, trained checkpoints, etc. Additionally, the user could further filter out trained checkpoints based on which training phase they were generated in, i.e. `phase00`, `phase05`, and `phase10`.

We can iterate on the behavior and optimize for UX, but the
idea is you should be able to run either: `ilab model list` or
`ilab list models` which produces a full list of all models on the local system.

The output of `ilab model list` would look like the following:

```
MODEL ID BASE MODEL SIZE DATA TYPE DATASET TRAINED WITH
abcde ibm-granite/granite-7b-base 28Gi FP32 phase00-dataset-id
cbdea abcde 28Gi FP32 phase05-dataset-id
```



Finally, we would add a new `ilab model delete <ref>` command
which would enable us to pass a reference to the model we want
to delete.


### Managing datasets

New datasets would be generated through `ilab generate` (or whatever the current equivalent is) and the corresponding dataset would be saved under the `datasets/` directory. Each model would also have a unique ref associated with it.

Following this, we could list out the datasets with `ilab data list` and further filter them out with either `knowledge` and `skills`, or `phase00`, `phase05`, and `phase10` depending on which SDG scheme is being used.

Here's an example output from `ilab data list`

```
DATASET ID PHASE TYPE SIZE NUM SAMPLES TEACHER MODEL
phase00-dataset-id knowledge_base 500MB 650k mistralai/mixtral-8x7B
phase05-dataset-id knowledge 230MB 330k mistralai/mixtral-8x7B
phase10-dataset-id skils 500MB 550k mistralai/mixtral-8x7B
```

An `ilab data delete <ref>` command would be added as well
so that the user can easily delete generated datasets.

Additionally, the `ilab model train` command would be able
to receive a ref to a dataset for easy access.


### Managing Eval Runs

When we run evaluations, we expect to receive a certain value which is tied to a model checkpoint.

Under this new framework, the evaluation outputs could be saved under the `eval_data` subdirectory, where `ilab` could later use to get the score for a particular model.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved
This has already been partly implemented in [PR#1369](https://github.com/instructlab/instructlab/pull/1369).

When we evaluate a particular model with `ilab eval <model-ref>`, we would then save the corresponding score in this new directory and link it back to the model that was evaluated and the metric.

By linking a given model with an evaluation score, we could provide an `ilab eval list` command that can output scores
for particular models.

Here's a sample output of the `ilab eval list` command:

```
MODEL REF BASE MODEL EVAL SCORE METRIC EVAL DATE
abcde ibm-granite/granite-7b-base 84.2% MMLU 2024-06-24
cbdea abcde 78.9% MMLU 2024-06-23
fghij ibm-granite/granite-7b-base 7.89 MT-Bench 2024-06-22
```

Since this would be JSON data, we don't expect it to be very expensive or consume a lot of space, so clearing it should not be a priority.


We could provide an `ilab eval clear` function that erases all of the evaluation data. We could also generate refs for the eval runs as well, and clear them using `ilab eval clear <eval-ref>`.


## Scope of this proposal


This proposal provides a long-term goal of numerous components and how we expect them to interact with each other.

The main point here is to move the config to live in a single place so that the `ilab` command can be executed from any directory with confidence. In addition, we require the use of refs and CRUD-style commands to simplify the management and interaction of data generated by `ilab`.

The steps outlined to implement this are:

1. Implement logic for the `~/.config/instructlab` and `~/.local/share/instructlab` directories, and expect the config to live and be read from there unless another config is manually specified. For simplicity, if the config we read in is located at `~/.config/instructlab` - the default location - then we'll assume relative paths should be resolved with respect to the data directory `~/.local/share/instructlab`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be that relative paths are resolved with respect to the config directory?

1. Implement logic to the existing commands so that datasets, models, and eval scores can be read from `ilab list`
1. Implement ref logic and update all existing commands to be able to resolve objects from refs


Based on these steps, #1 would be the trickiest because we assume that the paths referenced in the config file are relative to the calling process.

One approach here could be removing these paths from the default and forcing the user to specify them relatively.
Copy link
Member

@n1hility n1hility Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding what you mean on this sentence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n1hility My wording here is bad but the idea is simple:

  • When the config read is the default one ~/.config/ilab/config.yaml then parse paths relative to ~/.local/share/ilab.
  • When that's NOT the config, we would parse relative paths with respect to the calling process. E.g. if ilab is invoked in /foo and the config.model_path references a file in bar/granite_7b.gguf then it would resolve to /foo/bar/granite_7b.gguf.
  • All absolute paths will be parsed as-is


Another approach would be implementing the list commands at the same time as #1 and setting it up so that the relative paths passed to `ilab` are relative ***to the `~/.local/share/instructlab` directory***
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any compatibility with existing configs? For example it looks in the current dir and if there is a config.yml expects abs paths, or is this an intentional major change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the experience be like without #3 (implementing refs) in your list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would list tell you were a file was in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change then but we'd need to make it. We can maintain compatibility by allowing users to provide a --config flag which should read in the config as we do today. Not sure how much harder this would be versus what we're doing today.



## Follow-up work

In the future, we may want to have config migration and evolution so that we can be forward-compatible.
Loading