introduce ilab train peft ilab train phased and other commands fo…

…r increased model fidelity This enhancement discusses "more intensive" training and data generation techniques as well as a new Data Mixing command. This is all built off of the command redesign. The goal here is to produce higher fidelity models using the CLI. Signed-off-by: Charlie Doern <[email protected]>
instructlab · Jun 11, 2024 · 010da7c · 010da7c
1 parent 3f447c4
commit 010da7c
Showing 1 changed file with 210 additions and 0 deletions.
diff --git a/docs/lofi-hifi-backends.md b/docs/lofi-hifi-backends.md
@@ -0,0 +1,210 @@
+# Introduce Instructlab profiles managed via `ilab profile...` to run different key commands at different Fidelity levels
+
+This document describes adding different data generation, mixing, and training backends to ilab to enable higher fidelity training using the backend code.
+
+Rather than augmenting the existing commands to have conflicting and confusing flags that the team will be forced to maintain, deprecate, and add onto, I am proposing we open the door to these new workflows via instructlab "profiles" and the `ilab profile` command group. 
+
+Currently all training is done via qlora or the like. Adding the following commands will enable higher fidelity training and introduce commands such as data mixing.
+
+This document focuses primarily on training, specifically different configuration types or "profiles" for `ilab model train`.
+
+## Key Component
+
+### Building off of the InstructLab Structure Redesign
+
+After github.com/instructlab/instructlab/pull/990 is merged, ilab will use a parent -> child command structure. This proposal operates under that new structure.
+
+`ilab model train` will still be a single command offered to users. However, new "profiles" offered by the CLI will enable users to toggle what mode they are running instructlab in. These profiles will initially just have global options for training. Rather than adding dozens of flags to the CLI, storing them in "pre-baked" configs is a good first step.
+
+Eventually, we will want to enable a `ilab profile init` command that will allow users to initialize a custom profile based off of one of the pre-baked ones. This is not targeted for an upcoming release though. 
+
+### Immediate Goals
+
+For the near future, a single upper level profile that can be initialized via `ilab profile set <profile_name>`
+
+A profile can be either a json or yaml object that would look something like:
+
+```yaml
+profile:
+    train:
+        gpus: 0-4
+        epochs: 15
+        accelerator: deepspeed
+        ds_config: foobar.json
+```
+Eventually this profile would have settings for generation, eval, etc. But for immediate goals, hardcoded training settings is the MVP. Rather than having a `--config` option at the `ilab model train` level, storing the profile at the global level allows us to expand this idea to other ilab commands in the future. We need to be careful about how we introduce new concepts like this.
+
+For immediate releases I would introduce the idea of a profile, a command to set a specific profile, and hardcoded profiles that plug into key ilab commands namely training.
+
+### Reasoning
+
+The underlying training, eval, and generation libraries will handle the specifics based off of the config provided via the profile. For example is a user passes CPU training, no Deepspeed/FSDP etc then the training library will run the equivalent of "linux_train" that currently exists, outputting a model ready to be used. If the user has 4 GPUS, Deepspeed enabled and 15 epochs, the training library might give you a series of checkpoints.
+
+`ilab checkpoint evaluate` will be used in conjunction with `ilab model train` when the user is running multi-phase training. This command will run full scale inter-checkpoint evaluation on the given directory. An output dir will then hold then best checkpoint and all necessary data to run another `ilab train phased` command on.
+
+Plugging into hardware acceleration and multi-phase training is the logical next step for ilab. Ensuring we do this in a clean way that does not overload our current commands is also crucial. Many of the processes in the backend are confusing so we want to abstract some of the steps away from users while also giving them a reasonable amount of choice in configuring these new processes. However, maintaining the current laptop story is important to users without hardware access. Splitting these two paths into separate profiles clearly defined for users maintains the integrity of each.
+
+Not only maintaining the current laptop story but enhancing it. Qlora by itself is great, but having LoRA based training with the ability to run ZeRO phaes 1-3 or full FSDP accleration given the right hardware will improve the experience for people on 1 or 2 GPU systems (which are quite common). The user profiles and the ability to eventually initialize a custom profile will allow users to mix and match these techniques.
+
+### Implementation specifics for `ilab model train`
+
+ilab model train will maintain all of its current flags which will represent overrides for the values stored in the training section of a user profile. This means all of the existing flags will be ported to a default config with sane values set for all that are applicable. Some flags that change frequently like `--data-dir` and `--model-dir` for phased training will be added to the training command.
+
+The existing code which runs the HF trainer will be ported to another library which can also run multi-phase training. This library will be shelled out to using torchrun.
+
+some new entries in the config file that will not be exposed as user level flags are:
+
+- accelerator=str (deepspeed, fsdp) describes the optimizer framework to use during training
+- gpus=str describes the amount of GPUs (of what is available) to use for this process. This comes in the form of: 0-1, 8, etc.
+- ds_config=str desceibes a path to a .json file configuring deepspeed. 
+
+Keep in mind though, this is the community CLI! I feel as though we should try to find a middle ground between server usecases and community usecases. Exposing an eventual way to mix and match different training techniques makes sense for the following use cases:
+
+1. Developer with a Gaming PC:
+    * Transformers+Pytorch support Qlora&&FSDP. While deepspeed might be a more "server-rack" use-case, having multi-phase training in the CLI for anyone with a consumer GPU makes sense.
+2. Someone interested in ML, has a Homelab, or *anything with 2 GPUs*
+    * Trandformers+Pytorch supports Deepspeed on a single system spreading the training over the GPUs. Any professional or Hobbyist that has 2 GPUs will be looking for this option
+3. The laptop usecase
+    * Maintianing Qlora as the performant training mode for the laptop is crucial as most people cannot handle the full models. However, unlocking some better results by using FSDP+Qlora could improve local results and get people more interested in InstructLab.
+
+The transformers library supports the following combinations:
+
+1. LoRA + BitsNBytes Quantized Model + Deepspeed (QLoRA + DeepSpeed)
+    - someone with a consumer GPU could get much better training results from using this combo as opposed to torch without deespeed and inter-checkpoint eval
+2. Lora + FSDP (untested by me)
+3. LoRA + DeepSpeed
+    - This enables native multi-GPU support out of the box. Model Loading, trainer setup, and deepspeed initialization are all handeled by transformers. I have tested this and it works great on something like 2x4090 (3090 too) or 2xA10
+
+The deepspeed config used for peft and phased will be pretty similar but differ in some key ways. Both of these will be hardcoded for the original version.
+
+transformers lets you put "auto" for any field that has a corresponding `TrainingArguments` option. For example the dtype of the model can be passed into the DS config using `bf16: auto` and `fp16: auto`. 
+
+for community usecases, a DeepSpeed config would look something like:
+- use ZeRO phase 2
+- offload to CPU, since NVME offloading is not guaranteed.
+- use a train batch size of 48
+- Set FP16 and BF16 to auto, whichever one the model's dtype defaults to will be auto populated into the config.
+- optimizer is AdamW with the learning rate and all other params set to auto, these are most commonly set by transformers, especially if you pass in one of pytorch's optimizers to the training arguments.
+- scheduler is WarmupLR with all params set to auto and handeled by transformers.
+
+Using Phase 3 is unecessary here as that deals with larger systems with more data spread over more GPUs, it is also typically slower than Phase 2 due to the sharding of data.
+
+The key pieces of information here is the cpu offloading, the utilization of phase 2, and defaulting to "auto" for most other settings that are not directly related to performance uplifts for single or dual GPU support.
+
+**NOTE**: Changing the way we use the transformers library is not in scope for upcoming releases. This document serves as a proposal for upcoming and future enhancements to training, generation, and evaluation.
+
+The DeepSpeed config currently used for higher fidelity model training will have more hardcoded values, until profile options are added in later versions allowing users to configure things like the trianing batch size, offloading optimizer etc. with PEFT most of that is handeled using the `auto` keyword that transformers can funnel defaults into.
+
+However, if `--device=cpu` the offload optimizer and other parts of the deepspeed config will not be relevant.
+
+
+### changes to ilab data generate
+
+ilab data generate configuration should also be placed into top level profiles. Whether this will be scoped for upcoming release is unclear though the end result should look something like:
+
+```yaml
+profile:
+    train:
+        gpus: 0-4
+        epochs: 15
+        accelerator: deepspeed
+    generate:
+        gpus: 1-2
+        num_instructions: 1000
+        taxonomy_path: /path/to/large/taxonomy
+        num_grounded_questions: 10
+        num_samples: 10
+        ...
+```
+
+A lot of this exists in the current `config.yaml` structure. We could place this profile into config.yaml with these entries embedded into it. Though, seperating performance related entities out of config.yaml and into these profiles makes more sense in the long run. 
+
+some new entries like num_grounded_questions, num_samples, gpus will only exist in the profile but existing flags will remain as overrides for the profile values.
+
+### ilab data mix
+
+This command would take something like the following arguments
+
+--num-util-proc=int
+--output-dir=str (defaults to generated/mixed)
+--knowledge-recipes=[]str (path to yaml)
+--skill-recipes=[]str (path to yaml)
+
+Similar to data generation and model training, performance related flags should be baked into a profile if any exist.
+
+* Do we need an `ilab recipe` cmd? *
+
+
+### ilab model evaluate full AND ilab model evaluate checkpoint
+
+This command will take something like the following flags:
+
+--benchmarks [mmlu, mt, pr-mmlu, pr-bench (mt-pr)] default: (mmlu)
+    - you can only run certain benchmarks depending on what type of evaluation you are doing. 
+Note: We could have `ilab model evaluate` as a single command and take flags that depend on eachother like `--checkpoint-dir` and `--benchmarks` but in general, with the new CLI design we are trying to get out of the habit of flags that depend on eachother.
+--output-dir: str, determines where the best checkpoint is put for the next phase of training
+--input-dir: str, takes the directory of the model/checkpoint to evaluate. 
+
+This command will run inter-checkpoint evaluation on the output of a `ilab model train phased` command. 
+
+If implemented into a user profile, evaluation would look something like this:
+
+```yaml
+profile:
+    train:
+        gpus: 0-4
+        epochs: 1
+    generate:
+        gpus: 1-2
+        num_instructions: 1000
+        taxonomy_path: /path/to/large/taxonomy
+        num_grounded_questions: 10
+        num_samples: 10
+    evaluate:
+        benchmarks: mmlu
+        ...
+```
+
+where (for now) the main entries in this profile is the ability to configure which benchmarks to run during evaluation with the default simply being mmlu bench
+
+
+## Workflows to be included in the profiles now and in the future
+
+### ilab model train using PEFT, DeepSpeed, and CUDA
+
+A user on a Desktop with a consumer GPU (assume RTX 20/30 series) should be able to:
+
+1. load the model in 4-bit-quantized form onto the GPUs vram
+2. setup the transformers trainer with this model, and with a hardcoded deepspeed config ilab would come with
+3. train for 5 epochs using AdamW as the optim and deepspeed on top of that.
+4. give you a model in safetensors format (we cannot convert a quantized safetensor model)
+
+
+The big advantage here is faster and higher fidelity training than currently exists in the CLI because of deepspeed (or fsdp). The user could even set this up for multi GPU or multi system support with future ilab enhancements.
+
+
+1. load the model in its full form to the GPUs vram (multiple GPUs in this case)
+2. setup the transformers trainer with this model, and with a hardcoded deepspeed config ilab would come with
+3. train for 5 epochs using AdamW as the optim and deepspeed on top of that.
+4. give you a model in safetensors format or GGUF format since this is not a bitsnbytes model.
+
+
+### ilab model train using multiple phases and checkpoint evaluation:
+
+A user would run something like (assuming phase00 has run) on a GPU enabled server.
+
+`ilab model train --device=cuda --model-dir=./phase00/model --data-dir=./phase00/data`
+`ilab checkpoint evaluate ./phase05/checkpoints --output-dir=./phase10`
+`ilab model train --device=cuda --model-dir=./phase10/model --data-dir=./phase10/data`
+....
+
+Basically they would run phased with an eval in between. The Eval looks at the checkpoints output by the previous phase and outputs a model dir in the next phase's working directory. 
+
+## Alternatives
+
+The other alternative is to keep the same train and generate commands and instead add a --backend or --hifi flag to trigger the high fidelity code. The issue here is that ilab train is already overloaded with pytorch, mlx, etc. Adding more switches and dials into the main train code will make it hard to maintain.
+
+
+
+
+