-
Notifications
You must be signed in to change notification settings - Fork 109
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [KUNLUN] add llama70B case * [KUNLUN] add llama70B case * Merge branch 'main' of https://github.com/ZLkanyo009/FlagPerf into main * Update README.md --------- Co-authored-by: zhangling21 <[email protected]>
- Loading branch information
1 parent
22efc70
commit 9b0280a
Showing
9 changed files
with
147 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
FROM iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.27 | ||
RUN /bin/bash -c "pip config set global.index-url https://mirror.baidu.com/pypi/simple" | ||
RUN /bin/bash -c "uname -a" | ||
RUN /bin/bash -c alias python3=python | ||
|
||
ENV PATH /root/miniconda/envs/python38_torch201_cuda/bin:$PATH |
14 changes: 14 additions & 0 deletions
14
training/kunlunxin/docker_image/megatron/megatron_install.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/bin/bash | ||
# using github mirrors to avoid github TTL | ||
#export https_proxy=http://10.1.0.34:7890 | ||
git clone https://githubfast.com/FlagOpen/FlagScale | ||
cd FlagScale | ||
|
||
git checkout eb0438a5459404e2e4c70b15fa37e9a197ab159d | ||
echo 'export PYTHONPATH=$PYTHONPATH:/home/FlagScale' >> /root/.bashrc | ||
source /root/.bashrc | ||
|
||
wget https://bd.bcebos.com/v1/klx-pytorch-work-bd/training/zhangling21_llama70B/xmlir201_5.run | ||
bash xmlir201_5.run | ||
XFLAGS --enable transformer_engine | ||
XFLAGS --enable flagscale |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
### 昆仑芯XPU配置与运行信息参考 | ||
#### 环境配置 | ||
- ##### 硬件环境 | ||
- 机器型号: 昆仑芯AI加速器组R480-X8 | ||
- 加速卡型号: 昆仑芯AI加速卡R300 | ||
- 多机网络类型、带宽: InfiniBand,200Gb/s | ||
|
||
- ##### 软件环境 | ||
- OS版本:Ubuntu 20.04 | ||
- OS kernel版本: 5.4.0-26-generic | ||
- 加速卡驱动版本:4.0.25 | ||
- Docker镜像和版本:iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.27 | ||
- 训练框架版本:xmlir | ||
- 训练编译器版本:xacc | ||
- 依赖软件版本:pytorch-2.0.1 | ||
|
||
|
||
### 运行情况 | ||
|
||
* 输入批尺寸 | ||
1. local_batchsize(micro_batchsize),简写为LBS,即实际进入模型的张量批尺寸,为config_H100x4x8.py中所写,在本case中默认为1 | ||
2. seqlength(max_position_embedding),简写为MPE,即实际进入模型的序列长度,为config_H100x4x8.py中所写,在本case中默认为4096 | ||
3. gradient_accumulate_steps,简写为GAS,即梯度累加步数,为ds_config.json中所写,在本case中默认为44 | ||
4. global_batchsize恒等于local_batchsize\*gradient_accumulate_steps\*data_parallel_size。在本case中,data_parallel_size=world_size/TPsize/PPsize。 | ||
|
||
* 通用指标 | ||
|
||
| 指标名称 | 指标值 | 特殊说明 | | ||
| ------------ | -------------------------- | ---------------------------------- | | ||
| 任务类别 | 自然语言理解 | | | ||
| 模型 | llama2_70b | | | ||
| 数据集 | pile wikipedia | | | ||
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16/bf16 | | ||
| 超参修改 | parallel,见“性能指标” | 格式为TPxPPyDPz,例如TP2PP1DP4 | | ||
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 | | ||
| 硬件设备简称 | nvidia H800 | | | ||
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB | | ||
| 计算使用率 | MFU,见“性能指标” | 参见PaLM论文定义 | | ||
| **吞吐量** | **token/p/s,见“性能指标”** | 平均单卡每秒处理的token数 | | ||
|
||
* 性能指标 | ||
|
||
值得注意的是,下列第4组实验的global_batchsize与llama2原始论文相同, 训练100 step,此项实验也将作为精度对齐所用实验。 | ||
|
||
| 配置 | precision | parallel | fix_hp | token/p/s | 是否精度对齐 | mem | MFU | | ||
| ------------------- | --------- | --------- | ---------------------------- | --------- | ----- | ----- | --- | | ||
| R300十机80卡(10x8) | fp32 | TP8PP10DP1 | / | / | / | 21/32 | / | | ||
| R300十机80卡(10x8) | amp | TP8PP10DP1 | GAS=1024(GBS=1024=4M tokens) | / | doing* | 21/32 | / | | ||
因缺少R300机器,在单卡R300与单卡GPU上初步验证精度。目前已通过减小模型层数的方式,在单卡R300与单卡GPU上验证精度。完整70B模型的精度验证进行中。 |
10 changes: 10 additions & 0 deletions
10
training/kunlunxin/llama2_70B-megatron/config/config_R300x10x8.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
seqlength = 4096 | ||
batchsize = 1 | ||
accumulate_steps = 44 | ||
train_tokens = 100000000 | ||
theoryflops = 256000000000000.0 | ||
epochs = 1 | ||
flashattn = False | ||
recompute = False | ||
tensor_parallel = 8 | ||
pipeline_parallel = 10 |
1 change: 1 addition & 0 deletions
1
training/kunlunxin/llama2_70B-megatron/config/environment_variables.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
export PATH=/root/miniconda/envs/python38_torch201_cuda/bin:$PATH |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
sentencepiece |
61 changes: 61 additions & 0 deletions
61
training/kunlunxin/llama2_70B-megatron/config/training_adapter.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
export PYTHONPATH=$PYTHONPATH:/home/FlagScale | ||
|
||
MIXED_PRECISION_ARGS="" | ||
|
||
CODE_PATH="/home/FlagScale/pretrain_llama.py" | ||
|
||
TRAINING_ARGS=" | ||
--train-samples $TRAIN_SAMPLES \ | ||
--eval-iters 0 \ | ||
--tensor-model-parallel-size $TP \ | ||
--pipeline-model-parallel-size $PP \ | ||
--micro-batch-size $M_BATCHSIZE \ | ||
--global-batch-size $G_BATCHSIZE \ | ||
--disable-bias-linear \ | ||
--optimizer adam \ | ||
--no-gradient-accumulation-fusion \ | ||
--recompute-granularity 'full' \ | ||
--recompute-num-layers 1 \ | ||
--recompute-method 'uniform' \ | ||
--no-async-tensor-model-parallel-allreduce \ | ||
--distribute-saved-activations | ||
" | ||
NETWORK_ARGS=" | ||
--num-layers 80 \ | ||
--hidden-size 8192 \ | ||
--num-attention-heads 64 \ | ||
--ffn-hidden-size 28672 \ | ||
--seq-length $SEQLENGTH \ | ||
--max-position-embeddings $SEQLENGTH \ | ||
--normalization RMSNorm \ | ||
--group-query-attention \ | ||
--num-query-groups 8 \ | ||
--use-rotary-position-embeddings \ | ||
--no-position-embedding \ | ||
--swiglu \ | ||
--multiple-of 4096 \ | ||
--untie-embeddings-and-output-weights | ||
" | ||
|
||
|
||
export BKCL_CCIX_BUFFER_GM=1 | ||
export BKCL_CCIX_RING=1 | ||
export BKCL_TREE_THRESHOLD=1 | ||
|
||
export BKCL_SOCKET_IFNAME=ibs11 | ||
export BKCL_USE_RDMA=0 | ||
|
||
export BKCL_RDMA_FORCE_TREE=1 | ||
export BKCL_ENABLE_XDR=0 | ||
export BKCL_RING_BUFFER_SIZE=1024000 | ||
export BKCL_RDMA_NICS=ibs11 | ||
export BKCL_FORCE_ALLREDUCE_IN_MULTINODE=1 | ||
worker_num=0 | ||
|
||
ulimit -c 0 | ||
export XMLIR_F_XPU_ENABLED_BOOL=true | ||
export ALLREDUCE_ASYNC=false | ||
export ALLGATHER_ASYNC=false | ||
export ALLREDUCE_FUSION=0 | ||
export BKCL_TIMEOUT=1800 | ||
export BKCL_FORCE_SYNC=1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters