Skip to content

Commit

Permalink
fix doc
Browse files Browse the repository at this point in the history
  • Loading branch information
sallyjunjun committed Aug 16, 2024
1 parent 42be3cd commit d23f858
Show file tree
Hide file tree
Showing 5 changed files with 282 additions and 90 deletions.
55 changes: 13 additions & 42 deletions README-zh-Hans.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,59 +3,30 @@
[English](./README.md) |
[简体中文](./README-zh-Hans.md)

该文件夹下包含了 huggingface 格式的模型modeling及configuration文件,以及使用InternEvo训练这些模型的样例train.py文件和配置文件。
## 简介
该项目中包含了huggingface格式的模型modeling及configuration文件,这些模型可以使用[InternEvo框架](https://github.com/InternLM/InternEvo)进行训练,并且支持加载huggingface中发布的模型checkpoint进行续训。

## 安装InternEvo
## 快速开始
该项目的examples路径中,适配了每个模型拉起训练的脚本,在安装InternEvo训练框架之后,可以一键启动,进行训练。

### 安装InternEvo
参考[InternEvo安装文档](https://github.com/InternLM/InternEvo/blob/develop/doc/install.md)

## 代码下载
### 代码下载
将InternEvo-HFModels中的文件下载到本地:
```bash
git clone https://github.com/InternLM/InternEvo-HFModels.git
```

## 启动训练
### 启动训练
根据需要运行的模型,选择指定的train.sh文件启动训练,如:
```bash
bash examples/internlm/internlm_7b/train.sh
```

## isp并行
如果需要开启isp并行模式训练,需要在启动训练前,修改config.py文件,将tensor并行模式改为isp,修改如下:
```bash
parallel = dict(
zero1=dict(size=-1),
tensor=dict(size=2, mode="isp"),
pipeline=dict(size=1, interleaved_overlap=True),
weight=dict(size=2, overlap=False, memory_pool=True),
)
```
其中,tensor中的size为序列并行的大小,weight中的size为isp模式中,权重并行的大小。
注意:这里weight参数中的overlap需要设置为False。

需要修改模型modeling文件,将head、attention计算以及mlp中涉及的linear初始化函数改为使用InternEvo提供的new_linear()函数。以internlm模型的modeling文件为例,修改如下:
```bash
from internlm.model.modules.linear import new_linear

class InternLMMLP(nn.Module):
super().__init__()
self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
self.act_fn = ACT2FN[hidden_act]

class InternLMAttention(nn.Module):
self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)

class InternLMForCausalLM(InternLMPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = InternLMModel(config)

self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
```
new_linear()函数的第一个参数标志参数的名称,可接受的名称范围为:"head"、"output"、"wqkv"、"wq"、"wk"、"wv"、"wkv"、"w1"、"w3"、"w13"、"wo"、"out_proj"、"w2",根据实际情况修改。
## 训练策略
huggingface模型接入InternEvo训练,支持packed_dataset以及flash_attention,在每个模型的config.py文件中,可以通过use_packed_dataset以及use_flash_attn两个开关控制是否开启。
支持InternEvo中的isp并行训练策略,isp并行的具体原理请参考[InternEvo文档](https://internevo.readthedocs.io/zh-cn/latest/parallel.html)

## 接入其他huggingface模型适配
对于huggingface上发布的其他模型,如果需要接入到InternEvo中训练,可以参考[huggingface模型适配](./huggingface_model/model_adapt_zh.md)
58 changes: 14 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,62 +3,32 @@
[English](./README.md) |
[简体中文](./README-zh-Hans.md)

This directory contains the model modeling and configuration files in the Hugging Face format, as well as the sample train.py file and configuration files for training these models with InternEvo.
## Introduction
This project includes the modeling and configuration files in the huggingface format. These models can be trained using the [InternEvo framework](https://github.com/InternLM/InternEvo) and support loading model checkpoints released by huggingface for continued training.

## Installation of InternEvo
## Quick Start
The examples path of this project is adapted with scripts for each model to start training. After installing the InternEvo training framework, you can start training with one click.

### Installation of InternEvo
Refer to the [InternEvo Installation Documentation](https://github.com/InternLM/InternEvo/blob/develop/doc/install.md)

## Code Download
### Code Download
Download the files from InternEvo-HFModels to your local machine:
```bash
git clone https://github.com/InternLM/InternEvo-HFModels.git
```

## Start Training
### Start Training
Run the specified train.sh to start training according to the model you need, for example:
```bash
bash examples/internlm/internlm_7b/train.sh
```

## isp Parallel
For parallel training in ISP mode, the config.py file needs to be modified before starting the training to change the tensor parallel mode to ISP. The modification is as follows:
```bash
parallel = dict(
zero1=dict(size=-1),
tensor=dict(size=2, mode="isp"),
pipeline=dict(size=1, interleaved_overlap=True),
weight=dict(size=2, overlap=False, memory_pool=True),
)
```
Here, the size value in tensor is the size of sequence parallelism, and the size value in weight is the size of weight parallelism in ISP mode.
Note: here overlap in weight parameter should be set to False.

The modeling file of the model needs to be modified to use the new_linear() function provided by InternEvo for the initialization of head, attention calculations, and mlp in the linear function. Taking the modeling file of the InternLM model as an example, the modification is as follows:
```bash
from internlm.model.modules.linear import new_linear
## Training Strategy
The huggingface model integrated into InternEvo training supports packed_dataset and flash_attention. In the config.py file of each model, you can control whether to enable these two features through the switches use_packed_dataset and use_flash_attn.
Supports the isp parallel training strategy in InternEvo. For details on the isp parallel training principle, please refer to the [InternEvo documentation](https://internevo.readthedocs.io/zh-cn/latest/parallel.html)

class InternLMMLP(nn.Module):
def __init__(self, hidden_size: int, intermediate_size: int, hidden_act: str):
super().__init__()
self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
self.act_fn = ACT2FN[hidden_act]

class InternLMAttention(nn.Module):
def __init__(self, config: InternLMConfig):
super().__init__()
self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)

class InternLMForCausalLM(InternLMPreTrainedModel):
def __init__self, config):
super().__init__(config)
self.model = InternLMModel(config)

self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
```
The first parameter of the new_linear() function indicates the name of the parameter and can be one of the following: "head", "output", "wqkv", "wq", "wk", "wv", "wkv", "w1", "w3", "w13", "wo", "out_proj", "w2". Modify according to the actual situation.
## Adaptation of Other Huggingface Models
For other models released on huggingface that need to be integrated into InternEvo for training, you can refer to the [huggingface model adaptation](./huggingface_model/model_adapt_en.md)

~
38 changes: 38 additions & 0 deletions huggingface_model/README-zh-Hans.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## isp并行
如果需要开启isp并行模式训练,需要在启动训练前,修改config.py文件,将tensor并行模式改为isp,修改如下:
```bash
parallel = dict(
zero1=dict(size=-1),
tensor=dict(size=2, mode="isp"),
pipeline=dict(size=1, interleaved_overlap=True),
weight=dict(size=2, overlap=False, memory_pool=True),
)
```
其中,tensor中的size为序列并行的大小,weight中的size为isp模式中,权重并行的大小。
注意:这里weight参数中的overlap需要设置为False。

需要修改模型modeling文件,将head、attention计算以及mlp中涉及的linear初始化函数改为使用InternEvo提供的new_linear()函数。以internlm模型的modeling文件为例,修改如下:
```bash
from internlm.model.modules.linear import new_linear

class InternLMMLP(nn.Module):
super().__init__()
self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
self.act_fn = ACT2FN[hidden_act]

class InternLMAttention(nn.Module):
self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)

class InternLMForCausalLM(InternLMPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = InternLMModel(config)

self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
```
new_linear()函数的第一个参数标志参数的名称,可接受的名称范围为:"head"、"output"、"wqkv"、"wq"、"wk"、"wv"、"wkv"、"w1"、"w3"、"w13"、"wo"、"out_proj"、"w2",根据实际情况修改。
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Adapting HuggingFace Models for InternEvo Packed and ISP Training
# Adapting HuggingFace Models for InternEvo Packed dataset and ISP Training

## Background

Expand Down Expand Up @@ -46,7 +46,7 @@ Step 3. Pass `cu_seqlens` and `max_seqlen` to flash attention varlen kernel for

```python
if use_packed_dataset:
attn_output = isp_flash_attn_varlen_func(
attn_output = hf_q_k_v_with_cu_seqlens(
query_states,
key_states,
value_states,
Expand Down Expand Up @@ -74,7 +74,7 @@ Step 2. Pass `cu_seqlens` and `max_seqlen` to flash attention varlen kernel for

```python
if use_packed_dataset:
attn_output = isp_flash_attn_varlen_func(
attn_output = hf_q_k_v_with_cu_seqlens(
query_states,
key_states,
value_states,
Expand Down Expand Up @@ -112,4 +112,44 @@ parallel = dict(

### Manual code adaption dispatch

T.B.A.
## isp Parallel
For parallel training in ISP mode, the config.py file needs to be modified before starting the training to change the tensor parallel mode to ISP. The modification is as follows:
```bash
parallel = dict(
zero1=dict(size=-1),
tensor=dict(size=2, mode="isp"),
pipeline=dict(size=1, interleaved_overlap=True),
weight=dict(size=2, overlap=False, memory_pool=True),
)
```
Here, the size value in tensor is the size of sequence parallelism, and the size value in weight is the size of weight parallelism in ISP mode.
Note: here overlap in weight parameter should be set to False.

The modeling file of the model needs to be modified to use the new_linear() function provided by InternEvo for the initialization of head, attention calculations, and mlp in the linear function. Taking the modeling file of the InternLM model as an example, the modification is as follows:
```bash
from internlm.model.modules.linear import new_linear

class InternLMMLP(nn.Module):
def __init__(self, hidden_size: int, intermediate_size: int, hidden_act: str):
super().__init__()
self.gate_proj = new_linear("w1", hidden_size, intermediate_size, bias=False)
self.down_proj = new_linear("w2", intermediate_size, hidden_size, bias=False)
self.up_proj = new_linear("w3", hidden_size, intermediate_size, bias=False)
self.act_fn = ACT2FN[hidden_act]

class InternLMAttention(nn.Module):
def __init__(self, config: InternLMConfig):
super().__init__()
self.q_proj = new_linear("wq", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.k_proj = new_linear("wk", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.v_proj = new_linear("wv", self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
self.o_proj = new_linear("wo", self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)

class InternLMForCausalLM(InternLMPreTrainedModel):
def __init__self, config):
super().__init__(config)
self.model = InternLMModel(config)

self.lm_head = new_linear("head", config.hidden_size, config.vocab_size, bias=False)
```
The first parameter of the new_linear() function indicates the name of the parameter and can be one of the following: "head", "output", "wqkv", "wq", "wk", "wv", "wkv", "w1", "w3", "w13", "wo", "out_proj", "w2". Modify according to the actual situation.
Loading

0 comments on commit d23f858

Please sign in to comment.