Skip to content

Commit

Permalink
[KUNLUN] add llama70B case (#470)
Browse files Browse the repository at this point in the history
* [KUNLUN] add llama70B case

* [KUNLUN] add llama70B case

* Merge branch 'main' of https://github.com/ZLkanyo009/FlagPerf into main

* Update README.md

---------

Co-authored-by: zhangling21 <[email protected]>
  • Loading branch information
ZLkanyo009 and ZhangLing21 authored Mar 5, 2024
1 parent 22efc70 commit 9b0280a
Show file tree
Hide file tree
Showing 9 changed files with 147 additions and 2 deletions.
4 changes: 3 additions & 1 deletion training/benchmarks/llama2_70B/megatron/megatron_main.sh
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,10 @@ LOGGING_ARGS="
--log-interval 1
"

CODE_PATH="/workspace/FlagScale/pretrain_llama.py"

source $VENDOR_SHELL
cmd="torchrun $DISTRIBUTED_ARGS /workspace/FlagScale/pretrain_llama.py \
cmd="torchrun $DISTRIBUTED_ARGS $CODE_PATH \
$TRAINING_ARGS \
$MIXED_PRECISION_ARGS \
$DATA_ARGS \
Expand Down
6 changes: 6 additions & 0 deletions training/kunlunxin/docker_image/megatron/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.27
RUN /bin/bash -c "pip config set global.index-url https://mirror.baidu.com/pypi/simple"
RUN /bin/bash -c "uname -a"
RUN /bin/bash -c alias python3=python

ENV PATH /root/miniconda/envs/python38_torch201_cuda/bin:$PATH
14 changes: 14 additions & 0 deletions training/kunlunxin/docker_image/megatron/megatron_install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
# using github mirrors to avoid github TTL
#export https_proxy=http://10.1.0.34:7890
git clone https://githubfast.com/FlagOpen/FlagScale
cd FlagScale

git checkout eb0438a5459404e2e4c70b15fa37e9a197ab159d
echo 'export PYTHONPATH=$PYTHONPATH:/home/FlagScale' >> /root/.bashrc
source /root/.bashrc

wget https://bd.bcebos.com/v1/klx-pytorch-work-bd/training/zhangling21_llama70B/xmlir201_5.run
bash xmlir201_5.run
XFLAGS --enable transformer_engine
XFLAGS --enable flagscale
49 changes: 49 additions & 0 deletions training/kunlunxin/llama2_70B-megatron/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
### 昆仑芯XPU配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器型号: 昆仑芯AI加速器组R480-X8
- 加速卡型号: 昆仑芯AI加速卡R300
- 多机网络类型、带宽: InfiniBand,200Gb/s

- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:4.0.25
- Docker镜像和版本:iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.27
- 训练框架版本:xmlir
- 训练编译器版本:xacc
- 依赖软件版本:pytorch-2.0.1


### 运行情况

* 输入批尺寸
1. local_batchsize(micro_batchsize),简写为LBS,即实际进入模型的张量批尺寸,为config_H100x4x8.py中所写,在本case中默认为1
2. seqlength(max_position_embedding),简写为MPE,即实际进入模型的序列长度,为config_H100x4x8.py中所写,在本case中默认为4096
3. gradient_accumulate_steps,简写为GAS,即梯度累加步数,为ds_config.json中所写,在本case中默认为44
4. global_batchsize恒等于local_batchsize\*gradient_accumulate_steps\*data_parallel_size。在本case中,data_parallel_size=world_size/TPsize/PPsize。

* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| ------------ | -------------------------- | ---------------------------------- |
| 任务类别 | 自然语言理解 | |
| 模型 | llama2_70b | |
| 数据集 | pile wikipedia | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16/bf16 |
| 超参修改 | parallel,见“性能指标” | 格式为TPxPPyDPz,例如TP2PP1DP4 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | nvidia H800 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 计算使用率 | MFU,见“性能指标” | 参见PaLM论文定义 |
| **吞吐量** | **token/p/s,见“性能指标”** | 平均单卡每秒处理的token数 |

* 性能指标

值得注意的是,下列第4组实验的global_batchsize与llama2原始论文相同, 训练100 step,此项实验也将作为精度对齐所用实验。

| 配置 | precision | parallel | fix_hp | token/p/s | 是否精度对齐 | mem | MFU |
| ------------------- | --------- | --------- | ---------------------------- | --------- | ----- | ----- | --- |
| R300十机80卡(10x8) | fp32 | TP8PP10DP1 | / | / | / | 21/32 | / |
| R300十机80卡(10x8) | amp | TP8PP10DP1 | GAS=1024(GBS=1024=4M tokens) | / | doing* | 21/32 | / |
因缺少R300机器,在单卡R300与单卡GPU上初步验证精度。目前已通过减小模型层数的方式,在单卡R300与单卡GPU上验证精度。完整70B模型的精度验证进行中。
10 changes: 10 additions & 0 deletions training/kunlunxin/llama2_70B-megatron/config/config_R300x10x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
seqlength = 4096
batchsize = 1
accumulate_steps = 44
train_tokens = 100000000
theoryflops = 256000000000000.0
epochs = 1
flashattn = False
recompute = False
tensor_parallel = 8
pipeline_parallel = 10
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
export PATH=/root/miniconda/envs/python38_torch201_cuda/bin:$PATH
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sentencepiece
61 changes: 61 additions & 0 deletions training/kunlunxin/llama2_70B-megatron/config/training_adapter.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
export PYTHONPATH=$PYTHONPATH:/home/FlagScale

MIXED_PRECISION_ARGS=""

CODE_PATH="/home/FlagScale/pretrain_llama.py"

TRAINING_ARGS="
--train-samples $TRAIN_SAMPLES \
--eval-iters 0 \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--micro-batch-size $M_BATCHSIZE \
--global-batch-size $G_BATCHSIZE \
--disable-bias-linear \
--optimizer adam \
--no-gradient-accumulation-fusion \
--recompute-granularity 'full' \
--recompute-num-layers 1 \
--recompute-method 'uniform' \
--no-async-tensor-model-parallel-allreduce \
--distribute-saved-activations
"
NETWORK_ARGS="
--num-layers 80 \
--hidden-size 8192 \
--num-attention-heads 64 \
--ffn-hidden-size 28672 \
--seq-length $SEQLENGTH \
--max-position-embeddings $SEQLENGTH \
--normalization RMSNorm \
--group-query-attention \
--num-query-groups 8 \
--use-rotary-position-embeddings \
--no-position-embedding \
--swiglu \
--multiple-of 4096 \
--untie-embeddings-and-output-weights
"


export BKCL_CCIX_BUFFER_GM=1
export BKCL_CCIX_RING=1
export BKCL_TREE_THRESHOLD=1

export BKCL_SOCKET_IFNAME=ibs11
export BKCL_USE_RDMA=0

export BKCL_RDMA_FORCE_TREE=1
export BKCL_ENABLE_XDR=0
export BKCL_RING_BUFFER_SIZE=1024000
export BKCL_RDMA_NICS=ibs11
export BKCL_FORCE_ALLREDUCE_IN_MULTINODE=1
worker_num=0

ulimit -c 0
export XMLIR_F_XPU_ENABLED_BOOL=true
export ALLREDUCE_ASYNC=false
export ALLGATHER_ASYNC=false
export ALLREDUCE_FUSION=0
export BKCL_TIMEOUT=1800
export BKCL_FORCE_SYNC=1
3 changes: 2 additions & 1 deletion training/run_benchmarks/config/test_conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,8 @@
# "transformer:pytorch:R300:1:8:1": "/raid/dataset/transformer/wmt14_en_de_joined_dict",
# "bigtransfer:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "efficientnet:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",

# "llama2_70B:megatron:R300:10:8:1": "/raid/dataset/llama2_70B_pretrain",

# iluvatar cases
# "bigtransfer:pytorch:BI-V100:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "vit:pytorch:BI-V100:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
Expand Down

0 comments on commit 9b0280a

Please sign in to comment.