[Examples] Finetune Falcon 7B and 40B Example (#2242)

* Adding Falcon Example * Update llm/falcon/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Training Script * New YAML Script * Adding time and cost * Adding Image * Small change to README * Update README.md * Small Updates * Small changes to pricing * wip * wip * add gpt3 * edits * lint * lint * updates --------- Co-authored-by: Romil Bhardwaj <[email protected]>
skypilot-org · Sep 11, 2023 · 9e115c9 · 9e115c9
1 parent ee5928f
commit 9e115c9
Show file tree

Hide file tree

Showing 3 changed files with 350 additions and 0 deletions.
diff --git a/llm/falcon/README.md b/llm/falcon/README.md
@@ -0,0 +1,73 @@
+# Finetuning Falcon with SkyPilot
+
+This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT. 
+
+* [Blog post](https://huggingface.co/blog/falcon)
+* [Repo](https://huggingface.co/tiiuae/falcon-40b)
+* [Training code](https://gist.github.com/pacman100/1731b41f7a90a87b457e8c5415ff1c14)
+
+
+## Prerequisites
+Install the latest SkyPilot and check your setup of the cloud credentials:
+```bash
+pip install git+https://github.com/skypilot-org/skypilot.git
+sky check
+```
+See the Falcon SkyPilot YAML for [training](train.yaml). Serving is currently a work in progress and a YAML will be provided for that soon! We are also working on adding an evaluation step to evaluate the model you finetuned compared to the base model.
+
+## Running Falcon on SkyPilot
+Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory, 
+but `Falcon-7b-sharded` requires only 40GB memory. Thus,
+* If your GPU has 40 GB memory or less (e.g., Nvidia A100): use `ybelkada/falcon-7b-sharded-bf16`.
+* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`. 
+
+Try `sky show-gpus --all` for supported GPUs.
+
+We can start the finetuning of Falcon model on Open Assistant's [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) data **with a single command**. It will automatically find the available cheapest VM on any cloud.
+
+**To finetune using different data**, simply replace the path in `timdettmers/openassistant-guanaco` with any other huggingface dataset.
+
+Steps for training on your cloud(s):
+
+1. In [train.yaml](train.yaml), set the following variables in `envs`:
+
+    - Replace the `OUTPUT_BUCKET_NAME` with a unique name. SkyPilot will create this bucket for you to store the model weights.
+    - Replace the `WANDB_API_KEY` to your own key. 
+    - Replace the `MODEL_NAME` with your desired base model. 
+
+2.  **Training the Falcon model using spot instances**:
+
+```bash
+sky spot launch -n falcon falcon.yaml
+```
+
+Currently, such `A100-80GB:1` spot instances are only available on AWS and GCP.
+
+[Optional] **To use on-demand `A100-80GB:1` instances**, which are currently available on Lambda Cloud, Azure, and GCP:
+```bash
+sky launch -c falcon -s falcon.yaml --no-use-spot
+```
+
+For reference, below is a loss graph you may expect to see, and the amount of time and the approximate cost of fine-tuning each of the models over 500 epochs (assuming a spot instance A100 GPU rate at $1.1 / hour and a A100-80GB rate of $1.61 / hour):
+
+<img width="524" alt="image" src="https://imgur.com/BDlHink.png">
+
+1. `ybelkada/falcon-7b-sharded-bf16`: 2.5 to 3 hours using 1 A100 spot GPU; total cost ≈ $3.3.
+
+2. `tiiuae/falcon-7b`: 2.5 to 3 hours using 1 A100 spot GPU; total cost ≈ $3.3.
+
+3. `tiiuae/falcon-40b`: 10 hours using 1 A100-80GB spot GPU; total cost ≈ $16.10.
+
+
+## Q&A
+
+Q: I see some bucket permission errors `sky.exceptions.StorageBucketGetError` when running the above:
+```
+...
+sky.exceptions.StorageBucketGetError: Failed to connect to an existing bucket 'YOUR_OWN_BUCKET_NAME'.
+Please check if:
+  1. the bucket name is taken and/or
+  2. the bucket permissions are not setup correctly. To debug, consider using gsutil ls gs://YOUR_OWN_BUCKET_NAME.
+```
+
+A: You need to replace the bucket name with your own globally unique name, and rerun the commands. New private buckets will be automatically created under your cloud account.
diff --git a/llm/falcon/falcon.yaml b/llm/falcon/falcon.yaml
@@ -0,0 +1,42 @@
+resources:
+  accelerators: A100-80GB:1
+  disk_size: 1000
+  disk_tier: high
+
+workdir: .
+
+envs:
+  MODEL_NAME: tiiuae/falcon-7b # [ybelkada/falcon-7b-sharded-bf16, tiiuae/falcon-7b, tiiuae/falcon-40b]
+  WANDB_API_KEY: $WANDB_KEY # Change to your own wandb key
+  OUTPUT_BUCKET_NAME: # Set a unique name for the bucket which will store model weights
+
+file_mounts:
+  /results: # Change if the output_dir parameter is changed below
+    name: $OUTPUT_BUCKET_NAME
+    mode: MOUNT
+
+setup: |
+  # Setup the environment
+  conda activate falcon
+  if [ $? -ne 0 ]; then
+    conda create -n falcon python=3.10 -y
+    conda activate falcon
+  fi
+
+  # Install dependencies
+  pip install -q -U transformers accelerate peft
+  pip install -q trl==0.4.6 datasets bitsandbytes einops wandb scipy torch
+
+run: |
+  conda activate falcon
+  wandb login $WANDB_API_KEY
+  echo "Starting training..."
+  python train.py \
+  --model_name $MODEL_NAME \
+  --max_seq_len 2048 \
+  --bf16 \
+  --group_by_length \
+  --bnb_4bit_compute_dtype bfloat16 \
+  --max_steps 500 \
+  --dataset_name timdettmers/openassistant-guanaco \
+  --output_dir /results
diff --git a/llm/falcon/train.py b/llm/falcon/train.py
@@ -0,0 +1,235 @@
+# Adapted from https://gist.github.com/pacman100/1731b41f7a90a87b457e8c5415ff1c14
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass
+from dataclasses import field
+from typing import Optional
+
+from datasets import load_dataset
+from peft import LoraConfig
+from peft.tuners.lora import LoraLayer
+import torch
+from transformers import AutoModelForCausalLM
+from transformers import AutoTokenizer
+from transformers import BitsAndBytesConfig
+from transformers import HfArgumentParser
+from transformers import TrainingArguments
+from trl import SFTTrainer
+
+########################################################################
+# This is a fully working simple example to use trl's RewardTrainer.
+#
+# This example fine-tunes any causal language model (GPT-2, GPT-Neo, etc.)
+# by using the RewardTrainer from trl, we will leverage PEFT library to finetune
+# adapters on the model.
+#
+########################################################################
+
+# Define and parse arguments.
+
+
+@dataclass
+class ScriptArguments:
+    """
+    These arguments vary depending on how many GPUs you have, what their capacity and features are, and what size model you want to train.
+    """
+
+    local_rank: Optional[int] = field(default=-1,
+                                      metadata={"help": "Used for multi-gpu"})
+
+    per_device_train_batch_size: Optional[int] = field(default=4)
+    per_device_eval_batch_size: Optional[int] = field(default=1)
+    gradient_accumulation_steps: Optional[int] = field(default=4)
+    learning_rate: Optional[float] = field(default=2e-4)
+    max_grad_norm: Optional[float] = field(default=0.3)
+    weight_decay: Optional[int] = field(default=0.001)
+    lora_alpha: Optional[int] = field(default=16)
+    lora_dropout: Optional[float] = field(default=0.1)
+    lora_r: Optional[int] = field(default=64)
+    max_seq_length: Optional[int] = field(default=512)
+    model_name: Optional[str] = field(
+        default="tiiuae/falcon-7b",
+        metadata={
+            "help": "The model that you want to train from the Hugging Face hub. E.g. gpt2, gpt2-xl, bert, etc."
+        },
+    )
+    dataset_name: Optional[str] = field(
+        default="timdettmers/openassistant-guanaco",
+        metadata={"help": "The preference dataset to use."},
+    )
+    use_4bit: Optional[bool] = field(
+        default=True,
+        metadata={"help": "Activate 4bit precision base model loading"},
+    )
+    use_nested_quant: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Activate nested quantization for 4bit base models"},
+    )
+    bnb_4bit_compute_dtype: Optional[str] = field(
+        default="float16",
+        metadata={"help": "Compute dtype for 4bit base models"},
+    )
+    bnb_4bit_quant_type: Optional[str] = field(
+        default="nf4",
+        metadata={"help": "Quantization type fp4 or nf4"},
+    )
+    num_train_epochs: Optional[int] = field(
+        default=1,
+        metadata={
+            "help": "The number of training epochs for the reward model."
+        },
+    )
+    fp16: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Enables fp16 training."},
+    )
+    bf16: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Enables bf16 training."},
+    )
+    packing: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Use packing dataset creating."},
+    )
+    gradient_checkpointing: Optional[bool] = field(
+        default=True,
+        metadata={"help": "Enables gradient checkpointing."},
+    )
+    optim: Optional[str] = field(
+        default="paged_adamw_32bit",
+        metadata={"help": "The optimizer to use."},
+    )
+    lr_scheduler_type: str = field(
+        default="constant",
+        metadata={
+            "help": "Learning rate schedule. Constant a bit better than cosine, and has advantage for analysis"
+        },
+    )
+    max_steps: int = field(
+        default=10000,
+        metadata={"help": "How many optimizer update steps to take"})
+    warmup_ratio: float = field(
+        default=0.03, metadata={"help": "Fraction of steps to do a warmup for"})
+    group_by_length: bool = field(
+        default=True,
+        metadata={
+            "help": "Group sequences into batches with same length. Saves memory and speeds up training considerably."
+        },
+    )
+    save_steps: int = field(
+        default=10, metadata={"help": "Save checkpoint every X updates steps."})
+    logging_steps: int = field(default=10,
+                               metadata={"help": "Log every X updates steps."})
+    output_dir: Optional[str] = field(
+        default="/results",
+        metadata={"help": "Directory where model checkpoints will be stored."},
+    )
+
+
+parser = HfArgumentParser(ScriptArguments)
+script_args = parser.parse_args_into_dataclasses()[0]
+
+
+def create_and_prepare_model(args):
+    compute_dtype = getattr(torch, args.bnb_4bit_compute_dtype)
+
+    bnb_config = BitsAndBytesConfig(
+        load_in_4bit=args.use_4bit,
+        bnb_4bit_quant_type=args.bnb_4bit_quant_type,
+        bnb_4bit_compute_dtype=compute_dtype,
+        bnb_4bit_use_double_quant=args.use_nested_quant,
+    )
+
+    if compute_dtype == torch.float16 and args.use_4bit:
+        major, _ = torch.cuda.get_device_capability()
+        if major >= 8:
+            print("=" * 80)
+            print(
+                "Your GPU supports bfloat16, you can accelerate training with the argument --bf16"
+            )
+            print("=" * 80)
+
+    device_map = {"": 0}
+
+    model = AutoModelForCausalLM.from_pretrained(args.model_name,
+                                                 quantization_config=bnb_config,
+                                                 device_map=device_map,
+                                                 trust_remote_code=True)
+
+    peft_config = LoraConfig(
+        lora_alpha=script_args.lora_alpha,
+        lora_dropout=script_args.lora_dropout,
+        r=script_args.lora_r,
+        bias="none",
+        task_type="CAUSAL_LM",
+        target_modules=[
+            "query_key_value",
+            "dense",
+            "dense_h_to_4h",
+            "dense_4h_to_h",
+        ],  # , "word_embeddings", "lm_head"],
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(script_args.model_name,
+                                              trust_remote_code=True)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    return model, peft_config, tokenizer
+
+
+training_arguments = TrainingArguments(
+    output_dir=script_args.output_dir,
+    per_device_train_batch_size=script_args.per_device_train_batch_size,
+    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
+    optim=script_args.optim,
+    save_steps=script_args.save_steps,
+    logging_steps=script_args.logging_steps,
+    learning_rate=script_args.learning_rate,
+    fp16=script_args.fp16,
+    bf16=script_args.bf16,
+    max_grad_norm=script_args.max_grad_norm,
+    max_steps=script_args.max_steps,
+    warmup_ratio=script_args.warmup_ratio,
+    group_by_length=script_args.group_by_length,
+    lr_scheduler_type=script_args.lr_scheduler_type,
+)
+
+model, peft_config, tokenizer = create_and_prepare_model(script_args)
+model.config.use_cache = False
+dataset = load_dataset(script_args.dataset_name, split="train")
+
+trainer = SFTTrainer(
+    model=model,
+    train_dataset=dataset,
+    peft_config=peft_config,
+    dataset_text_field="text",
+    max_seq_length=script_args.max_seq_length,
+    tokenizer=tokenizer,
+    args=training_arguments,
+    packing=script_args.packing,
+)
+
+for name, module in trainer.model.named_modules():
+    if isinstance(module, LoraLayer):
+        if script_args.bf16:
+            module = module.to(torch.bfloat16)
+    if "norm" in name:
+        module = module.to(torch.float32)
+    if "lm_head" in name or "embed_tokens" in name:
+        if hasattr(module, "weight"):
+            if script_args.bf16 and module.weight.dtype == torch.float32:
+                module = module.to(torch.bfloat16)
+
+trainer.train()