From d23f858d103d65e5ca928b26534b29f9c4274c45 Mon Sep 17 00:00:00 2001
From: sallyjunjun <jun_sally@126.com>
Date: Fri, 16 Aug 2024 20:16:22 +0800
Subject: [PATCH] fix doc

---
 README-zh-Hans.md                             |  55 ++----
 README.md                                     |  58 ++----
 huggingface_model/README-zh-Hans.md           |  38 ++++
 .../{README.md => model_adapt_en.md}          |  48 ++++-
 huggingface_model/model_adapt_zh.md           | 173 ++++++++++++++++++
 5 files changed, 282 insertions(+), 90 deletions(-)
 create mode 100644 huggingface_model/README-zh-Hans.md
 rename huggingface_model/{README.md => model_adapt_en.md} (57%)
 create mode 100644 huggingface_model/model_adapt_zh.md

diff --git a/README-zh-Hans.md b/README-zh-Hans.md
index 789aca9..cd44553 100644
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@@ -3,59 +3,30 @@
 [English](./README.md) |
 [简体中文](./README-zh-Hans.md)
 
-该文件夹下包含了 huggingface 格式的模型modeling及configuration文件，以及使用InternEvo训练这些模型的样例train.py文件和配置文件。
+## 简介
+该项目中包含了huggingface格式的模型modeling及configuration文件，这些模型可以使用[InternEvo框架](https://github.com/InternLM/InternEvo)进行训练，并且支持加载huggingface中发布的模型checkpoint进行续训。
 
-## 安装InternEvo
+## 快速开始
+该项目的examples路径中，适配了每个模型拉起训练的脚本，在安装InternEvo训练框架之后，可以一键启动，进行训练。
+
+### 安装InternEvo
 参考[InternEvo安装文档](https://github.com/InternLM/InternEvo/blob/develop/doc/install.md)
 
-## 代码下载
+### 代码下载
 将InternEvo-HFModels中的文件下载到本地：
 ```bash
 git clone https://github.com/InternLM/InternEvo-HFModels.git
 ```
 
-## 启动训练
+### 启动训练
 根据需要运行的模型，选择指定的train.sh文件启动训练，如：
 ```bash
 bash examples/internlm/internlm_7b/train.sh 
 ```
 
-## isp并行
-如果需要开启isp并行模式训练，需要在启动训练前，修改config.py文件，将tensor并行模式改为isp，修改如下：
-```bash
-parallel = dict(
-    zero1=dict(size=-1),
-    tensor=dict(size=2, mode="isp"),
-    pipeline=dict(size=1, interleaved_overlap=True),
-    weight=dict(size=2, overlap=False, memory_pool=True),
-)
-```
-其中，tensor中的size为序列并行的大小，weight中的size为isp模式中，权重并行的大小。
-注意：这里weight参数中的overlap需要设置为False。
-
-需要修改模型modeling文件，将head、attention计算以及mlp中涉及的linear初始化函数改为使用InternEvo提供的new_linear()函数。以internlm模型的modeling文件为例，修改如下：
-```bash
-from internlm.model.modules.linear import new_linear
-
-class InternLMMLP(nn.Module):
-         super().__init__()
-         self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
-         self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
-         self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
-         self.act_fn = ACT2FN[hidden_act]
- 
-class InternLMAttention(nn.Module):
-         self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-         self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-         self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-         self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
- 
-class InternLMForCausalLM(InternLMPreTrainedModel):
-     def __init__(self, config):
-         super().__init__(config)
-         self.model = InternLMModel(config)
- 
-         self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
-```
-new_linear()函数的第一个参数标志参数的名称，可接受的名称范围为："head"、"output"、"wqkv"、"wq"、"wk"、"wv"、"wkv"、"w1"、"w3"、"w13"、"wo"、"out_proj"、"w2"，根据实际情况修改。
+## 训练策略
+huggingface模型接入InternEvo训练，支持packed_dataset以及flash_attention，在每个模型的config.py文件中，可以通过use_packed_dataset以及use_flash_attn两个开关控制是否开启。
+支持InternEvo中的isp并行训练策略，isp并行的具体原理请参考[InternEvo文档](https://internevo.readthedocs.io/zh-cn/latest/parallel.html)
 
+## 接入其他huggingface模型适配
+对于huggingface上发布的其他模型，如果需要接入到InternEvo中训练，可以参考[huggingface模型适配](./huggingface_model/model_adapt_zh.md)
diff --git a/README.md b/README.md
index 1c40bdc..7045f16 100644
--- a/README.md
+++ b/README.md
@@ -3,62 +3,32 @@
 [English](./README.md) |
 [简体中文](./README-zh-Hans.md)
 
-This directory contains the model modeling and configuration files in the Hugging Face format, as well as the sample train.py file and configuration files for training these models with InternEvo.
+## Introduction
+This project includes the modeling and configuration files in the huggingface format. These models can be trained using the [InternEvo framework](https://github.com/InternLM/InternEvo) and support loading model checkpoints released by huggingface for continued training.
 
-## Installation of InternEvo
+## Quick Start
+The examples path of this project is adapted with scripts for each model to start training. After installing the InternEvo training framework, you can start training with one click.
+
+### Installation of InternEvo
 Refer to the [InternEvo Installation Documentation](https://github.com/InternLM/InternEvo/blob/develop/doc/install.md)
 
-## Code Download
+### Code Download
 Download the files from InternEvo-HFModels to your local machine:
 ```bash
 git clone https://github.com/InternLM/InternEvo-HFModels.git
 ```
 
-## Start Training
+### Start Training
 Run the specified train.sh to start training according to the model you need, for example:
 ```bash
 bash examples/internlm/internlm_7b/train.sh                            
 ```
 
-## isp Parallel
-For parallel training in ISP mode, the config.py file needs to be modified before starting the training to change the tensor parallel mode to ISP. The modification is as follows:
-```bash
-parallel = dict(
-    zero1=dict(size=-1),
-    tensor=dict(size=2, mode="isp"),
-    pipeline=dict(size=1, interleaved_overlap=True),
-    weight=dict(size=2, overlap=False, memory_pool=True),
-)
-```
-Here, the size value in tensor is the size of sequence parallelism, and the size value in weight is the size of weight parallelism in ISP mode.
-Note: here overlap in weight parameter should be set to False.
-
-The modeling file of the model needs to be modified to use the new_linear() function provided by InternEvo for the initialization of head, attention calculations, and mlp in the linear function. Taking the modeling file of the InternLM model as an example, the modification is as follows:
-```bash
-from internlm.model.modules.linear import new_linear
+## Training Strategy
+The huggingface model integrated into InternEvo training supports packed_dataset and flash_attention. In the config.py file of each model, you can control whether to enable these two features through the switches use_packed_dataset and use_flash_attn.
+Supports the isp parallel training strategy in InternEvo. For details on the isp parallel training principle, please refer to the [InternEvo documentation](https://internevo.readthedocs.io/zh-cn/latest/parallel.html)
 
-class InternLMMLP(nn.Module):
-    def __init__(self, hidden_size: int, intermediate_size: int, hidden_act: str):
-        super().__init__()
-        self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
-        self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
-        self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
-        self.act_fn = ACT2FN[hidden_act]
-
-class InternLMAttention(nn.Module):
-    def __init__(self, config: InternLMConfig):
-        super().__init__()
-        self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-        self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-        self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-        self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
-
-class InternLMForCausalLM(InternLMPreTrainedModel):
-    def __init__self, config):
-        super().__init__(config)
-        self.model = InternLMModel(config)
-
-        self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
-```
-The first parameter of the new_linear() function indicates the name of the parameter and can be one of the following: "head", "output", "wqkv", "wq", "wk", "wv", "wkv", "w1", "w3", "w13", "wo", "out_proj", "w2". Modify according to the actual situation.
+## Adaptation of Other Huggingface Models
+For other models released on huggingface that need to be integrated into InternEvo for training, you can refer to the [huggingface model adaptation](./huggingface_model/model_adapt_en.md)
 
+~                                                                                                                                                
diff --git a/huggingface_model/README-zh-Hans.md b/huggingface_model/README-zh-Hans.md
new file mode 100644
index 0000000..9d07213
--- /dev/null
+++ b/huggingface_model/README-zh-Hans.md
@@ -0,0 +1,38 @@
+## isp并行
+如果需要开启isp并行模式训练，需要在启动训练前，修改config.py文件，将tensor并行模式改为isp，修改如下：
+```bash
+parallel = dict(
+    zero1=dict(size=-1),
+    tensor=dict(size=2, mode="isp"),
+    pipeline=dict(size=1, interleaved_overlap=True),
+    weight=dict(size=2, overlap=False, memory_pool=True),
+)
+```
+其中，tensor中的size为序列并行的大小，weight中的size为isp模式中，权重并行的大小。
+注意：这里weight参数中的overlap需要设置为False。
+
+需要修改模型modeling文件，将head、attention计算以及mlp中涉及的linear初始化函数改为使用InternEvo提供的new_linear()函数。以internlm模型的modeling文件为例，修改如下：
+```bash
+from internlm.model.modules.linear import new_linear
+
+class InternLMMLP(nn.Module):
+         super().__init__()
+         self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
+         self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
+         self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
+         self.act_fn = ACT2FN[hidden_act]
+
+class InternLMAttention(nn.Module):
+         self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
+         self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
+         self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
+         self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
+
+class InternLMForCausalLM(InternLMPreTrainedModel):
+     def __init__(self, config):
+         super().__init__(config)
+         self.model = InternLMModel(config)
+
+         self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
+```
+new_linear()函数的第一个参数标志参数的名称，可接受的名称范围为："head"、"output"、"wqkv"、"wq"、"wk"、"wv"、"wkv"、"w1"、"w3"、"w13"、"wo"、"out_proj"、"w2"，根据实际情况修改。
diff --git a/huggingface_model/README.md b/huggingface_model/model_adapt_en.md
similarity index 57%
rename from huggingface_model/README.md
rename to huggingface_model/model_adapt_en.md
index 8a9982a..11a497e 100644
--- a/huggingface_model/README.md
+++ b/huggingface_model/model_adapt_en.md
@@ -1,4 +1,4 @@
-# Adapting HuggingFace Models for InternEvo Packed and ISP Training
+# Adapting HuggingFace Models for InternEvo Packed dataset and ISP Training
 
 ## Background
 
@@ -46,7 +46,7 @@ Step 3. Pass `cu_seqlens` and `max_seqlen` to flash attention varlen kernel for
 
 ```python
 if use_packed_dataset:
-    attn_output = isp_flash_attn_varlen_func(
+    attn_output = hf_q_k_v_with_cu_seqlens(
         query_states,
         key_states,
         value_states,
@@ -74,7 +74,7 @@ Step 2. Pass `cu_seqlens` and `max_seqlen` to flash attention varlen kernel for
 
 ```python
 if use_packed_dataset:
-    attn_output = isp_flash_attn_varlen_func(
+    attn_output = hf_q_k_v_with_cu_seqlens(
         query_states,
         key_states,
         value_states,
@@ -112,4 +112,44 @@ parallel = dict(
 
 ### Manual code adaption dispatch
 
-T.B.A.
\ No newline at end of file
+## isp Parallel
+For parallel training in ISP mode, the config.py file needs to be modified before starting the training to change the tensor parallel mode to ISP. The modification is as follows:
+```bash
+parallel = dict(
+    zero1=dict(size=-1),
+    tensor=dict(size=2, mode="isp"),
+    pipeline=dict(size=1, interleaved_overlap=True),
+    weight=dict(size=2, overlap=False, memory_pool=True),
+)
+```
+Here, the size value in tensor is the size of sequence parallelism, and the size value in weight is the size of weight parallelism in ISP mode.
+Note: here overlap in weight parameter should be set to False.
+
+The modeling file of the model needs to be modified to use the new_linear() function provided by InternEvo for the initialization of head, attention calculations, and mlp in the linear function. Taking the modeling file of the InternLM model as an example, the modification is as follows:
+```bash
+from internlm.model.modules.linear import new_linear
+
+class InternLMMLP(nn.Module):
+    def __init__(self, hidden_size: int, intermediate_size: int, hidden_act: str):
+        super().__init__()
+        self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
+        self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
+        self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+
+class InternLMAttention(nn.Module):
+    def __init__(self, config: InternLMConfig):
+        super().__init__()
+        self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
+        self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
+        self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
+        self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
+
+class InternLMForCausalLM(InternLMPreTrainedModel):
+    def __init__self, config):
+        super().__init__(config)
+        self.model = InternLMModel(config)
+
+        self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
+```
+The first parameter of the new_linear() function indicates the name of the parameter and can be one of the following: "head", "output", "wqkv", "wq", "wk", "wv", "wkv", "w1", "w3", "w13", "wo", "out_proj", "w2". Modify according to the actual situation.
diff --git a/huggingface_model/model_adapt_zh.md b/huggingface_model/model_adapt_zh.md
new file mode 100644
index 0000000..6ee6388
--- /dev/null
+++ b/huggingface_model/model_adapt_zh.md
@@ -0,0 +1,173 @@
+# 接入其他HuggingFace模型适配指南
+
+## unpack dataset及纯dp模式适配策略
+接入一个新的huggingface模型，如果想快速跑起来验证，使用unpack dataset及仅dp并行策略，需要使用如下步骤接入：
+
+步骤一：下载模型modeling_xxx.py及configuration_xxx.py文件
+在huggingface_model目录下创建模型对应的路径，并将下载的文件放到指定路径中。
+如：
+从huggingface下载google/gemma-2-9b模型，则在huggingface_model中创建google/gemma-2-9b路径，放置modeling和configuration文件。
+并在examples中创建google/gemma-2-9b路径，放置模型启动train.py文件，InternEvo所需config.py文件，以及启动运行train.sh脚本文件。
+
+步骤二：修改modeling文件，设置参数属性
+在模型入口类的__init__函数中，执行self.post_init()之后，为参数设置IS_REPLICA_ZERO_PARALLEL属性，标记dp模式下，参数weight不进行切分。
+如在Gemma2ForCausalLM类的__init__函数中，添加如下代码：
+```python
+from internlm.core.context.parallel_context import IS_REPLICA_ZERO_PARALLEL
+
+        ......
+
+        for module in self.modules():
+            for param in module.parameters():
+                setattr(param, IS_REPLICA_ZERO_PARALLEL, True)
+```
+
+步骤三：修改train.py文件
+参考其他模型的train.py文件编写，替换其中的模型类及config。如：
+参考internlm中的train.py，将InternLMForCausalLM、InternLMConfig分别替换为新接入的Gemma2ForCausalLM、Gemma2Config。
+
+initialize_model函数，不需要有入参，因为在unpack及纯dp模式下不涉及dispatch设置。
+
+步骤四：修改config.py文件
+使用unpack dataset以及纯dp模式训练，需要修改的参数如下：
+```python
+data = dict(
+    ......
+    use_packed_dataset=False,
+    ......
+)
+
+parallel = dict(
+    zero1=dict(size=-1),
+    tensor=dict(size=1, mode="mtp"),
+    pipeline=dict(size=1, interleaved_overlap=True),
+    weight=dict(size=1, overlap=False, memory_pool=True),
+)
+```
+
+如果需要加载huggingface模型的ckpt，则使用如下配置：
+```python
+ckpt = dict(
+    enable_save_ckpt=False,
+    save_ckpt_folder=SAVE_CKPT_FOLDER,
+    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="hf_model"),
+    auto_resume=False,
+    ......
+)
+```
+
+如果不需要加载ckpt，从随机初始化的weight开始训练，则将load_ckpt_info设置为None即可。
+
+如果不需要加载ckpt，同时需要在训练的过程中保存ckpt，并在训练被意外中断之后，再次启动能够从保存的ckpt处续训，则设置如下：
+```python
+ckpt = dict(
+    enable_save_ckpt=True,  
+    save_ckpt_folder=SAVE_CKPT_FOLDER,  
+    load_ckpt_info=None,
+    auto_resume=True,
+    ......
+)
+```
+
+## pack dataset及isp并行模式适配策略
+### 背景
+当HuggingFace模型与InternEvo框架集成时，我们希望支持打包训练和ISP并行训练，原因是：
+1. 提高GPU计算利用率（减少在无意义的填充token上的计算浪费）
+2. 支持长序列训练（使用InternEvo框架中最新的并行技术）
+
+这需要适配模型以支持：
+1. 打包训练
+2. ISP（Intern序列并行）训练
+
+### 前置工作
+1. 接入新模型需要首先完成上述 unpack dataset及纯dp模式适配策略 中的 步骤一 和 步骤三 。
+注意：适配isp并行策略，不需要执行 步骤二 ，参数属性的设置会在下面的适配步骤中具体说明。
+2. 新模型的modeling文件中，需要支持FlashAttention2的实现。如不支持，需要先新增对FlashAttention2的支持。
+
+### 手动更改modeling文件适配
+手动更改modeling的方案，需要在train.py的initialize_model函数入参中，设置auto_dispatch=False，这里只会检查手动替换的正确性。
+
+#### 适配packed dataset
+
+#### 适配ISP并行训练
+
+nan潜在问题，引出dispatch方式适配，dispatch中需要reset_parameters
+
+### dispatch方式适配
+dispatch适配的方案，需要在train.py的initialize_model函数入参中，设置auto_dispatch=True，启用dispatch_utils中自动替换适配的逻辑。
+
+### 新增split_weights函数，支持isp模式加载ckpt
+加载huggingface模型的checkpoint，在isp模式下会涉及到wqkv wo w1 w2 w3 embedding output等参数的权重切分。因此，需要在类似InternLM2ForCausalLM的类中，新增split_weights函数，处理参数权重切分。
+在InternEvo训练框架中，会在加载huggingface权重处，调用这个函数对权重进行切分。以InternLM2为例，split_weights函数的具体实现如下：
+```python
+    def split_weights(self, first_layer, model_state_dict, state_dict, split_size, local_rank, row_dim):
+        for i in range(0, gpc.config.model.num_layers):
+            model_state_dict[f"model.layers.{i}.attention.wqkv.weight"] = torch.chunk(
+                state_dict.pop(f"model.layers.{i+first_layer}.attention.wqkv.weight"),
+                split_size,
+                dim=0,
+            )[local_rank]
+            model_state_dict[f"model.layers.{i}.attention.wo.weight"] = torch.chunk(
+                state_dict.pop(f"model.layers.{i+first_layer}.attention.wo.weight"),
+                split_size,
+                dim=row_dim,
+            )[local_rank]
+            model_state_dict[f"model.layers.{i}.feed_forward.w1.weight"] = torch.chunk(
+                state_dict.pop(f"model.layers.{i+first_layer}.feed_forward.w1.weight"),
+                split_size,
+                dim=0,
+            )[local_rank]
+            model_state_dict[f"model.layers.{i}.feed_forward.w3.weight"] = torch.chunk(
+                state_dict.pop(f"model.layers.{i+first_layer}.feed_forward.w3.weight"),
+                split_size,
+                dim=0,
+            )[local_rank]
+            model_state_dict[f"model.layers.{i}.feed_forward.w2.weight"] = torch.chunk(
+                state_dict.pop(f"model.layers.{i+first_layer}.feed_forward.w2.weight"),
+                split_size,
+                dim=row_dim,
+            )[local_rank]
+            model_state_dict[f"model.layers.{i}.attention_norm.weight"] = state_dict.pop(
+                f"model.layers.{i+first_layer}.attention_norm.weight"
+            )
+            model_state_dict[f"model.layers.{i}.ffn_norm.weight"] = state_dict.pop(
+                f"model.layers.{i+first_layer}.ffn_norm.weight"
+            )
+
+        if (gpc.get_local_rank(ParallelMode.PIPELINE) - 1 == 0) or (not gpc.is_using_parallel_mode(ParallelMode.PIPELINE)):
+            model_state_dict[f"model.tok_embeddings.weight"] = torch.chunk(
+                state_dict.pop(f"model.tok_embeddings.weight"),
+                split_size,
+                dim=1,
+            )[local_rank]
+
+        if gpc.is_last_rank(ParallelMode.PIPELINE):
+            model_state_dict[f"output.weight"] = torch.chunk(
+                state_dict.pop(f"output.weight"),
+                split_size,
+                dim=0,
+            )[local_rank]
+            model_state_dict[f"model.norm.weight"] = state_dict[f"model.norm.weight"]
+
+        return model_state_dict
+```
+
+### 修改config.py文件
+使用pack dataset以及isp模式训练，需要修改的参数如下：
+```python
+data = dict(
+    ......
+    use_packed_dataset=True,
+    ......
+)
+
+parallel = dict(
+    zero1=dict(size=-1),
+    tensor=dict(size=2, mode="isp"),
+    pipeline=dict(size=1, interleaved_overlap=True),
+    weight=dict(size=2, overlap=False, memory_pool=True),
+)
+```
+其中，tensor的size为sequence parallel的大小，weight的size为weight parallel的大小。
+
+关于是否加载ckpt的设置，与 unpack dataset及纯dp模式适配策略 中的 步骤四 相同。