Skip to content

Latest commit

 

History

History
363 lines (326 loc) · 14.7 KB

Yuan2_fastchat.md

File metadata and controls

363 lines (326 loc) · 14.7 KB

基于Fastchat实现源2.0的微调和推理部署

FastChat是一个用于训练、部署和评估基于LLM(大型语言模型)的聊天机器人的开放平台。通过使用huggingface transformers支持LLM基于deepspeed/fsdp的多节点多卡微调;下述介绍使用FastChat微调yuan2.0模型的流程。

准备微调环境

  • docker pull nvcr.io/nvidia/pytorch:23.08-py3
  • docker run -v HOST_WORK_PATH:/workspace/ --ipc=host --gpus all -p host-port:container-port --shm-size='64g' -it nvcr.io/nvidia/pytorch:23.08-py3 /bin/bash
  • git clone https://github.com/lm-sys/FastChat.git
  • cd FastChat
  • pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
  • pip install -e ".[model_worker,webui,train]"
  • pip install deepspeed “bitsandbytes>=0.39.0” “transformers==4.31.0” plotly openai

准备模型及数据

  • 获取yuan2.0 huggingface模型文件:
  • 准备数据:FastChat为聊天机器人训练及服务做支持,因此其所需要的标准数据集为多轮及单轮对话数据集。
    (1)自定义数据集时,使用fastchat所要求的数据格式,如使用下述json格式的文件定义单论或多轮对话数据集。
    (2)使用已有的指令数据集改造为单轮对话,可以使用alpaca-data英文中文数据集进行对应格式的改造。
    (3)使用开源的多轮对话数据集,如BELLE项目开源的用户与助手的多轮对话数据集bella-0.8M
#multi turns example
[
   [
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "Have a nice day!"
      },
      {
        "from": "gpt",
        "value": "You too!"
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
]
# single turn example
[
  {
    "id": "1",
    "conversations": [
      {
        "from": "human",
        "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:"
      },
      {
        "from": "gpt",
        "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
      }
    ]
  },
  {
    "id": "2",
    "conversations": [
      {
        "from": "human",
        "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:"
      },
      {
        "from": "gpt",
        "value": "The three primary colors are red, blue, and yellow."
      }
    ]
  },

使用fastchat 中定制的yuan2.0训练脚本文件

在fastchat/train/train_mem.py脚本中
- from fastchat.train.train import train
+ from fastchat.train.train_yuan2 import train

在fastchat/train/train_lora.py脚本中
-from fastchat.train.train import (
-    DataArguments,
-    ModelArguments,
-    make_supervised_data_module,
-)

+from fastchat.train.train_yuan2 import (
+    DataArguments,
+    ModelArguments,
+    make_supervised_data_module,
+)

将fastchat/train/train_yuan2.py脚本中的special tokenizer复制到train_lora.py
+tokenizer.add_tokens(
+        [
+            "<eod>",
+            "<sep>",
+            "<pad>",
+           "<mask>",
+           "<predict>",
+           "<FIM_SUFFIX>",
+          "<FIM_PREFIX>",
+          "<FIM_MIDDLE>",
+          "<commit_before>",
+          "<commit_msg>",
+          "<commit_after>",
+          "<jupyter_start>",
+          "<jupyter_text>",
+          "<jupyter_code>",
+          "<jupyter_output>",
+          "<empty_output>",
+      ],
+      special_tokens=True,
+  )

fastchat中添加的yuan2_template相关信息,以下内容无需修改,开发者如有特殊需求可调整或改变如下相关模板信息

#yuan template infomation

fastchat/conversation.py脚本,包含yuan2.0 chat定制的模板信息

# Yuan2.0 chat template
# source: https://huggingface.co/IEITYuan/Yuan2-2B-Janus-hf/blob/main/tokenizer_config.json#L6
register_conv_template(
    Conversation(
        name="yuan2",
        roles=("user", "assistant"),
        sep_style=SeparatorStyle.YUAN2,
        sep="<sep>",
        sep2="\n",
        stop_token_ids=[
            77185,
        ],  # "<eod>"
        stop_str="<eod>",
    )
)
fastchat/model/model_adapter.py脚本, 包含yuan2.0 chat模型及tokenizer加载时的函数

class Yuan2Adapter(BaseModelAdapter):
    """The model adapter for Yuan2.0"""

    def match(self, model_path: str):
        return "yuan2" in model_path.lower()

    def load_model(self, model_path: str, from_pretrained_kwargs: dict):
        revision = from_pretrained_kwargs.get("revision", "main")
        # from_pretrained_kwargs["torch_dtype"] = torch.bfloat16
        tokenizer = LlamaTokenizer.from_pretrained(
            model_path,
            add_eos_token=False,
            add_bos_token=False,
            eos_token='<eod>',
            eod_token='<eod>',
            sep_token='<sep>',
            revision = revision,
        )
        tokenizer.add_tokens(
            ['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>', '<commit_before>',
             '<commit_msg>', '<commit_after>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>',
             '<jupyter_output>', '<empty_output>'], special_tokens=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            # device_map='auto',
            trust_remote_code=True,
            **from_pretrained_kwargs
        )
        return model, tokenizer

    def get_default_conv_template(self, model_path: str) -> Conversation:
        return get_conv_template("yuan2")

fastchat/model/model_yuan2.py脚本,包含yuan2.0 chat模型生成内容时的默认设置

源2.0全量微调

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train_mem.py \
        --model_name_or_path  path-to-huggingface-models \
        --trust_remote_code True\
        --data_path ./data/alpaca_data_zh_conversion.json \
        --bf16 True \
        --output_dir ./test_yuan2b_full \
        --num_train_epochs 3 \
        --per_device_train_batch_size 4 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 4 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 1200 \
        --save_total_limit 10 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --tf32 True \
        --model_max_length 1024 \
        --gradient_checkpointing True \
        --lazy_preprocess True \
        --deepspeed playground/zero2_ds_woloading.json \
        --efficient_loss False \
        --split_example_loss True \
        --last_response_loss False \


--model_max_length 可以指定微调时单个样本最大长度;

--efficient_loss,--split_example_loss,--last_response_loss,代表了三种不同的面对多轮对话的loss计算方式。(1) efficient_loss代表计算聊天助手回答部分的loss;(2) last_response_loss代表只计算最后一轮聊天助手回答部分的loss;(3) split_example_loss代表将多轮对话拆分成多组样本,计算每组样本中最后一轮聊天助手内容部分的loss。选择时有且仅有一个为True,其余为False。

其余参数可以参考fastchat源码transformers理解

  • zero2 config文件参考
{
  "zero_optimization": {
     "stage": 2,
     "allgather_partitions": true,
     "allgather_bucket_size": 5e8,
     "reduce_scatter": true,
     "reduce_bucket_size": 5e8,
     "overlap_comm": false,
     "contiguous_gradients": true
  },
   "bf16": {
   "enabled": "auto",
   "loss_scale": 0,
   "initial_scale_power": 16,
   "loss_scale_window": 1000,
   "hysteresis": 2,
   "min_loss_scale": 1
   },
    "flops_profiler": {
    "enabled": true,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
  },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true
}

lora及Qlora高效微调


对大模型进行全量微调是一件昂贵的事情,我们可以使用高效微调的方法,通过给大模型添加额外参数,对新添加的参数进行微调进而改进大模型性能,如loraQlora高效微调方案。
lora本质上是一种重参数化方法,通过在参数矩阵添加旁支,来微调大模型性能。lora通过只在部分权重矩阵上添加旁支,来降低计算量;通过只更新旁支矩阵的参数,降低显存占用及并行通信量。
Qlora在lora的基础上将模型权重量化为4bit,并将scale参数再进行一次量化(double quant),以达到显存进一步节省的目的。需要注意的是Qlora相比于lora一般会添加更多的旁支矩阵,其并不能加速计算,反而会有效率上的损失。
使用fastchat可以通过如下方式非常方便对yuan2.0模型进行基于lora和Qlora的高效微调。

  • 使用train_lora.py脚本,torchrun --nproc_per_node=8 --master_port=XXXX fastchat/train/train_lora.py .....
  • 使用--lora_target_modules指定模型添加的lora模块,可以指定"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"中的一个或多个,默认使用"q_proj", "v_proj"
  • 使用--lora_r指定lora矩阵的秩
  • 当指定--q_lora (True or False)指定是否使用Qlora进行高效微调
  • 高效微调在进行多轮对话微调时loss计算方式与全量微调一致,可以使用yuan2.0定义的三种不同方式中的一种
    高效微调参考脚本如下:
CUDA_VISIBLE_DEVICES=0 python  fastchat/train/train_lora.py \
        --model_name_or_path  hf-to-yuan-path \
        --trust_remote_code True\
        --data_path ./data/alpaca-data-conversation.json \
        --bf16 True \
        --output_dir ./checkpoints_yuan2_2b_lora \
        --num_train_epochs 3 \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 16 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 1200 \
        --save_total_limit 10 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --model_max_length 512 \
        --gradient_checkpointing True \
        --lazy_preprocess True \
        --q_lora True \
        --efficient_loss False \
        --split_example_loss True \
        --last_response_loss False \


微调实测数据参考

微调方案 序列长度 Model 精度:加载/计算 GPU bs:micro/global 显存占用(1*GPU) epoch耗时
ds_zero2_full 2048 Yuan-2 2B bf16/bf16 8*L20 1/128 16G 1.68h
ds_zero3_lora 2048 Yuan-2 51B bf16/bf16 8*L20 1/128 43G 23h
ds_zero3_lora 2048 Yuan-2 102B bf16/bf16 8*L20 1/128 45G 47h
ds_zero2_full 1024 Yuan-2 2B bf16/bf16 8*L20 1/128 15G 1.3h
ds_zero3_lora 1024 Yuan-2 51B bf16/bf16 8*L20 1/128 43G 18h
ds_zero3_lora 1024 Yuan-2 102B bf16/bf16 8*L20 1/128 42G 40h
Qlora 1024 Yuan-2 2B int4/bf16 1*L20 1/16 4.5G 3.4h

以上测试使用52K条alpaca-samples,改造为单轮对话数据;epoch耗时为微调单个epoch的时间

微调部署及使用

基于yuan2.0微调完成的chat模型,使用fastchat可以非常方便的进行服务部署及使用。

  • 命令行方式
使用N个GPU部署chat模型
python3 -m fastchat.serve.cli --model PATH-TO_CHATMODELS --num-gpus N
  • WebGUI方式
python3 -m fastchat.serve.controller  --host 0.0.0.0 &
python3 -m fastchat.serve.model_worker --model-path PATH-TO_CHATMODELS --host 0.0.0.0 &
#--gpus 0,1,2,3 --num-gpus 4 指定使用4个GPU加载模型进行推理
python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 映射的IP端口号
  • OpenAI-Compatible RESTful APIs

python3 -m fastchat.serve.controller --host 0.0.0.0 &
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker --model-path PATH-TO_CHATMODELS --host 0.0.0.0 --port 31000 --worker http://0.0.0.0:31000 &
python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 1234

#client 调用
import openai

openai.api_key = "EMPTY"
openai.base_url = "http://0.0.0.0:1234/v1/"

model = "yuan2-2B"
prompt = "Once upon a time"

# create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)

# create a chat completion
completion = openai.chat.completions.create(
          model=model,
            messages=[{"role": "user", "content": "Hello! What is your name?"}]
            )
# print the completion
print(completion.choices[0].message.content)

我们可以在langchain使用OpenAI-Compatible RESTful APIs完成基于LLM的应用构建。