diff --git a/docs/lofi-hifi-backends.md b/docs/lofi-hifi-backends.md new file mode 100644 index 00000000..08378a10 --- /dev/null +++ b/docs/lofi-hifi-backends.md @@ -0,0 +1,197 @@ +# Introduce Instructlab Profiles Managed via `ilab profile...` to Run Key Commands at Different Fidelity Levels + +This document describes adding different data generation, mixing, and training backends to `ilab` to enable higher fidelity training using the backend code. By higher fidelity we mean models that preform better, were trained on better hardware, off of larger data sets and using more intensive training techniques. + +Rather than augmenting the existing commands to have conflicting and confusing flags that the team will be forced to maintain, deprecate, and add onto, I am proposing we open the door to these new workflows via instructlab "profiles" and the `ilab profile` command group. + +This document focuses primarily on training, specifically different configuration types or "profiles" for `ilab model train`. + +## Key Component + +### Building off of the InstructLab Structure Redesign + +`ilab model train` will still be a single command offered to users. However, new "profiles" offered by the CLI will enable users to toggle what mode they are running instructlab in. These profiles will initially just have global options for training. Rather than adding dozens of flags to the CLI, storing them in "pre-baked" configs is a good first step. + +Eventually, we will want to enable a `ilab profile init` command that will allow users to initialize a custom profile based off of one of the pre-baked ones. This is not targeted for an upcoming release though. + +### Immediate Goals + +For the near future, there will be a single upper level profile that can be initialized via `ilab profile set ` + +A profile can be either a json or yaml object that would look something like: + +```yaml +profile: + train: + gpus: 0-4 + epochs: 15 + accelerator: deepspeed + ds_config: foobar.json +``` + +#### Where would this be stored? + +Click has the ability to associate objects with the context passed around the command structure. The profile could be: + +1. a ctx.obj additionally to the ctx.obj.config object. So this would be referenced as ctx.obj.profile. +2. the profile could live inside of the config obj, moving some of the existing generation options into a sub object called `profile`. + +#### Is this just for training? + +For now, yes. The forthcoming training library is taking an approach where it will have pre-baked "config" objects that will feed into the library. These profiles are an attempt to expand this workflow to other commands. Putting the structure in place for training only in upcoming releases allows us to iterate on this for evaluation, generation, chatting, etc. + + +Eventually this profile would have settings for generation, eval, etc. But for immediate goals, hardcoded training settings is the MVP. Rather than having a `--config` option at the `ilab model train` level, storing the profile at the global level allows us to expand this idea to other ilab commands in the future. We need to be careful about how we introduce new concepts like this. + +**PLEASE NOTE:** + +**For immediate releases I would introduce the idea of a profile, a command to set a specific profile, and hardcoded profiles that plug into key ilab commands namely training.** + +### Reasoning + +The profile settings will be used as arguments for most if not all libraries being introduced to the ilab backend. + +Plugging into hardware acceleration and multi-phase training is the logical next step for ilab. Ensuring we do this in a clean way that does not overload our current commands is also crucial. Many of the processes in the backend are confusing so we want to abstract some of the steps away from users while also giving them a reasonable amount of choice in configuring these new processes. However, maintaining the current laptop story is important to users without hardware access. Splitting these two paths into separate profiles clearly defined for users maintains the integrity of each. + +Having a single entrypoint in terms of the command is a safe bet, since making different training commands for each technique we introduce will scale poorly. `ilab model train` after setting a profile will account for most of the usecases `ilab` targets. + +Not only maintaining the current laptop story but enhancing it. QLoRA by itself is great, but having LoRA based training with the ability to run ZeRO phases 1-3 or full FSDP acceleration given the right hardware will improve the experience for people on 1 or 2 GPU systems (which are quite common). The user profiles and the ability to eventually initialize a custom profile will allow users to mix and match these techniques. + +### Implementation specifics for `ilab model train` + +ilab model train will maintain all of its current flags which will represent overrides for the values stored in the training section of a user profile. This means all of the existing flags will be ported to a default config with sane values set for all that are applicable. Some flags that change frequently like `--data-dir` and `--model-dir` for phased training will be added to the training command. + +The existing code which runs the HF trainer will be ported to another library which can also run multi-phase training. This library will be shelled out to using torchrun. + +some new entries in the config file that will not be exposed as user level flags are: + +- accelerator=str (deepspeed, fsdp) describes the optimizer framework to use during training +- gpus=str describes the amount of GPUs (of what is available) to use for this process. This comes in the form of: 0-1, 8, etc. +- ds_config=str desceibes a path to a .json file configuring deepspeed. + +Trying to find a middle ground between server use cases and laptop use case is important. Exposing an eventual way to mix and match different training techniques makes sense for the following use cases: + +1. Developer with a Gaming PC: + * Transformers+Pytorch support QLoRA && FSDP. While deepspeed might be a more "server-rack" use-case, having multi-phase training in the CLI for anyone with a consumer GPU makes sense. +2. Someone interested in ML, has a Homelab, or *anything with 2 GPUs* + * Trandformers+Pytorch supports Deepspeed on a single system spreading the training over the GPUs. Any professional or Hobbyist that has 2 GPUs will be looking for this option +3. The laptop usecase + * Maintianing QLoRA as the performant training mode for the laptop is crucial as most people cannot handle the full models. However, unlocking some better results by using FSDP+QLoRA could improve local results and get people more interested in InstructLab. + +The transformers library supports the following combinations: + +1. LoRA + BitsNBytes Quantized Model + Deepspeed (QLoRA + DeepSpeed) + - someone with a consumer GPU could get much better training results from using this combo as opposed to torch without deespeed and inter-checkpoint eval +2. LoRA + FSDP +3. LoRA + DeepSpeed + - This enables native multi-GPU support out of the box. Model Loading, trainer setup, and deepspeed initialization are all handled by transformers. I have tested this and it works great on something like 2x4090 (3090 too) or 2xA10 + +**NOTE**: Changing the way we use the transformers library is not in scope for upcoming releases. This document serves as a proposal for upcoming and future enhancements to training, generation, and evaluation. the notes above regarding combining LoRA, DS, FSDP, etc are meant to guide us in future training enhancements and to describe how there are too many options to add as conflicting flags. + +### Changes to `ilab data generate` + +ilab data generate configuration should also be placed into top level profiles. Whether this will be scoped for upcoming release is unclear though the end result should look something like: + +NOTE: **GENERATION PROFILES ARE NOT IN SCOPE FOR UPCOMING RELEASES BUT ARE BEING DOCUMENTED HERE FOR CLARITY** + +```yaml +profile: + train: + gpus: 0-4 + epochs: 15 + accelerator: deepspeed + generate: + gpus: 1-2 + num_instructions: 1000 + taxonomy_path: /path/to/large/taxonomy + num_grounded_questions: 10 + num_samples: 10 + ... +``` + +A lot of this exists in the current `config.yaml` structure. We could place this profile into config.yaml with these entries embedded into it. Though, seperating performance related entities out of config.yaml and into these profiles makes more sense in the long run. + +some new entries like num_grounded_questions, num_samples, gpus will only exist in the profile but existing flags will remain as overrides for the profile values. + +### `ilab data mix` + +This command would take something like the following arguments + +--num-util-proc=int +--output-dir=str (defaults to generated/mixed) +--knowledge-recipes=[]str (path to yaml) +--skill-recipes=[]str (path to yaml) + +Similar to data generation and model training, performance related flags should be baked into a profile if any exist. + +### `ilab model evaluate full` AND `ilab model evaluate checkpoint` + +instructlab will run MMLU bench and MT bench in upcoming releases. These benchmarks are useful for inter-checkpoint evaluation and full scale evaluation at the end of multi-phase training. + +MMLU bench needs the following options: +- --model +- --tasks +- --few-shots (int) +- --batch-size + +MT bench needs the following options: +- --api-key +- --model + +some other flags would be: +- --output-dir: str, determines where the best checkpoint is put for the next phase of training +- --input-dir: str, takes the directory of the model/checkpoint to evaluate. + + +Since this is a new command, these will be added as flags. In the future, they will serve as an override for profile settings in evaluation. Specifically `batch-size`, `few-shots` and `model` would live in the profile. This is what an example user profile would look like: + +NOTE: **EVALUATION PROFILES ARE NOT IN SCOPE FOR UPCOMING RELEASES BUT ARE BEING DOCUMENTED HERE FOR CLARITY** + +```yaml +profile: + train: + gpus: 0-4 + epochs: 1 + generate: + gpus: 1-2 + num_instructions: 1000 + taxonomy_path: /path/to/large/taxonomy + num_grounded_questions: 10 + num_samples: 10 + evaluate: + mmlu: + batch-size: 48 + few-shots: 1 + model: granite + ... +``` + +## Workflows using proifiles + +1. `ilab profile set profiles/cpu_only.yaml` +2. `ilab model generate` +3. `ilab model train` + +to use multi-gpu: + +1. `ilab prodile set profiles/multi_gpu_cuda.yaml` +2. `ilab model generate` +3. `ilab model train --model-dir=./phase00/model --data-dir=./phase00/data` +4. `ilab checkpoint evaluate --model-dir=./phase10/model --data-dir=./phase10/data` + +the above multi-gpu is a good example of where the profiles come into play. If the user had not set their profile, the user would need to specify flags like: + +`ilab model generate --num-instructions=1000 --num-cpus=15` +`ilab model train --devide=cuda --gpus=0:4 --epochs=15 --config=training_config.json --accelerator=deepspeed --deepspeed_config=ds_config.json --model-dir=./phase00/model --data-dir=./phase00/data --model-repo=instructlab/granite-7b-lab` +... + +These are just GPU enabled flags I thought of, there will be more that the defaults will not cover, and multi-phase training is bound to have more specifics once the library is created. + +## Alternatives + +The other alternative is to add these profile options as flags including specific training level yaml profiles. Adding profiles and configs for specific commands is not good UX and will confuse most users. + + + + +