-
Notifications
You must be signed in to change notification settings - Fork 445
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
410d952
commit df7a194
Showing
12 changed files
with
283 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# NPU训练最佳实践 | ||
|
||
## 目录 | ||
- [环境准备](#环境准备) | ||
- [微调](#微调) | ||
- [推理](#推理) | ||
|
||
|
||
## 环境准备 | ||
|
||
实验环境:8 * 昇腾910B3 | ||
|
||
```shell | ||
pip install ms-swift -U | ||
pip install torch-npu | ||
``` | ||
|
||
测试环境是否安装正确: | ||
```python | ||
from transformers.utils import is_torch_npu_available | ||
import torch | ||
|
||
print(is_torch_npu_available()) # True | ||
print(torch.npu.device_count()) # 8 | ||
``` | ||
|
||
## 微调 | ||
以下介绍LoRA的微调, 全参数微调设置参数`--sft_type full`即可. | ||
|
||
|
||
### 单卡训练 | ||
|
||
通过如下命令启动单卡微调: | ||
|
||
```shell | ||
# 实验环境: 昇腾910B3 | ||
# 显存需求: 25GB | ||
# 运行时长: 8小时 | ||
ASCEND_RT_VISIBLE_DEVICES=0 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
``` | ||
|
||
|
||
### 数据并行训练 | ||
|
||
```shell | ||
# 实验环境: 4 * 昇腾910B3 | ||
# 显存需求: 4 * 30GB | ||
# 运行时长: 2小时 | ||
NPROC_PER_NODE=4 \ | ||
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
``` | ||
|
||
|
||
### Deepspeed训练 | ||
|
||
ZeRO2: | ||
```shell | ||
# 实验环境: 4 * 昇腾910B3 | ||
# 显存需求: 4 * 28GB | ||
# 运行时长: 3小时 | ||
NPROC_PER_NODE=4 \ | ||
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
--deepspeed default-zero2 \ | ||
``` | ||
|
||
ZeRO3: | ||
```shell | ||
# 实验环境: 4 * 昇腾910B3 | ||
# 显存需求: 4 * 25GB | ||
# 运行时长: 8小时 | ||
NPROC_PER_NODE=4 \ | ||
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
--deepspeed default-zero3 \ | ||
``` | ||
|
||
|
||
## 推理 | ||
|
||
原始模型: | ||
```shell | ||
ASCEND_RT_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat | ||
``` | ||
|
||
LoRA微调后: | ||
```shell | ||
ASCEND_RT_VISIBLE_DEVICES=0 swift infer --ckpt_dir xxx/checkpoint-xxx --load_dataset_config true | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# NPU Best Practice | ||
|
||
## Table of Contents | ||
- [Environment Preparation](#Environment-Preparation) | ||
- [Fine-tuning](#Fine-tuning) | ||
- [Inference](#Inference) | ||
|
||
## Environment Preparation | ||
|
||
Experimental environment: 8 * Ascend 910B3 | ||
|
||
```shell | ||
pip install ms-swift -U | ||
pip install torch-npu | ||
``` | ||
|
||
Verify the installation of the testing environment: | ||
```python | ||
from transformers.utils import is_torch_npu_available | ||
import torch | ||
|
||
print(is_torch_npu_available()) # True | ||
print(torch.npu.device_count()) # 8 | ||
``` | ||
|
||
## Fine-tuning | ||
The following introduces the fine-tuning of LoRA. Set the parameter `--sft_type full` for full parameter fine-tuning. | ||
|
||
|
||
### Single Card Training | ||
|
||
Start single card fine-tuning with the following command: | ||
|
||
```shell | ||
# Experimental Environment: Ascend 910B3 | ||
# GPU Memory Requirement: 25GB | ||
# Runtime: 8 hours | ||
ASCEND_RT_VISIBLE_DEVICES=0 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
``` | ||
|
||
|
||
### Training with DDP | ||
|
||
```shell | ||
# Experimental Environment: 4 * Ascend 910B3 | ||
# GPU Memory Requirement: 4 * 30GB | ||
# Runtime: 2 hours | ||
NPROC_PER_NODE=4 \ | ||
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
``` | ||
|
||
|
||
### Training with DeepSpeed | ||
|
||
ZeRO2: | ||
```shell | ||
# Experimental Environment: 4 * Ascend 910B3 | ||
# GPU Memory Requirement: 4 * 28GB | ||
# Runtime: 3 hours | ||
NPROC_PER_NODE=4 \ | ||
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
--deepspeed default-zero2 \ | ||
``` | ||
|
||
ZeRO3: | ||
```shell | ||
# Experimental Environment: 4 * Ascend 910B3 | ||
# GPU Memory Requirement: 4 * 25GB | ||
# Runtime: 8 hours | ||
NPROC_PER_NODE=4 \ | ||
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \ | ||
swift sft \ | ||
--model_type qwen1half-7b-chat \ | ||
--dataset blossom-math-zh \ | ||
--num_train_epochs 5 \ | ||
--sft_type lora \ | ||
--output_dir output \ | ||
--deepspeed default-zero3 \ | ||
``` | ||
|
||
|
||
## Inference | ||
|
||
Original Model: | ||
```shell | ||
ASCEND_RT_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat | ||
``` | ||
|
||
After LoRA Fine-tuning: | ||
```shell | ||
ASCEND_RT_VISIBLE_DEVICES=0 swift infer --ckpt_dir xxx/checkpoint-xxx --load_dataset_config true | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.