多机训练的训练速度和单机一样 #6111

Wiselnn570 · 2024-11-22T01:41:40Z

Reminder

I have read the README and searched the existing issues.

System Info

在total_batch_size相同的情况下，单机（8卡）训练速度和多机（16卡）一样。对于想使用这个仓库scale数据规模成了阻碍

Reproduction

使用的torchrun调用
脚本为

#!/bin/bash
set -x -e
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
# export CUDA_LAUNCH_BLOCKING=1
echo "PYTHONPATH: ${PYTHONPATH}"
which_python=$(which python)
echo "which python: ${which_python}"
export PYTHONPATH=${PYTHONPATH}:${which_python}
export PYTHONPATH=${PYTHONPATH}:.
echo "PYTHONPATH: ${PYTHONPATH}"
export NNODES=2
export num_gpus=8
export WANDB_DISABLED=true
export full_batch_size=128
export batch_size=1
export gradient_accumulation_steps=$[$full_batch_size/($batch_size*$num_gpus*$NNODES)]
export CPUS_PER_TASK=20
export MASTER_PORT=$((RANDOM % 101 + 29400))
## slurm
export PARTITION=mllm
export JOB_NAME=rope
export QUOTA_TYPE=spot
export output_dir=/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-m_rope-16card-test
export model_name_or_path=/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct-with-Qwen2-Language-Backbone
srun -p ${PARTITION} \
    --job-name=${JOB_NAME} \
    --gres=gpu:${num_gpus} \
    --time=2-00:00:00 \
    --nodes=${NNODES} \
    --ntasks-per-node=1 \
    --cpus-per-task=${CPUS_PER_TASK} \
    bash -c 'torchrun \
    --nnodes $NNODES \
    --nproc_per_node ${num_gpus:-1} \
    --node_rank="${SLURM_NODEID}" \
    --master_addr=$(scontrol show hostname $SLURM_NODELIST | head -n1) \
    --master_port=$MASTER_PORT \
    /mnt/petrelfs/weixilin/projects/MLLM/LLaMA-Factory/src/train.py \
    --deepspeed examples/deepspeed/ds_z3_config.json \
    --tokenized_path /mnt/petrelfs/weixilin/cache/training_qwen2vl_pretokenized_data-8k-context/ \
    --model_name_or_path $model_name_or_path \
    --stage sft \
    --do_train true \
    --finetuning_type full \
    --dataset shot2story_caption,textvr_caption,youcook2_caption,videochat_caption,k710_classification,videochat1_conversation,videochat2_conversation,videochatgpt_conversation,clevr_mc,ego_qa,tgif_frame_qa,clevr_qa \
    --template qwen2_vl \
    --cutoff_len 32768 \
    --overwrite_cache true \
    --preprocessing_num_workers 64 \
    --output_dir $output_dir \
    --num_train_epochs 1.0 \
    --logging_steps 1 \
    --save_steps 2500 \
    --plot_loss true \
    --overwrite_output_dir true \
    --per_device_train_batch_size $batch_size \
    --gradient_accumulation_steps $gradient_accumulation_steps \
    --learning_rate 1.0e-5 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 true \
    --ddp_timeout 180000000 \
    --val_size 1 \
    --per_device_eval_batch_size 1 \
    --eval_strategy steps \
    --eval_steps 500 \
    --flash_attn fa2 \
    --report_to none'

训练日志16卡

8卡

Expected behavior

多机的训练时间比单机缩短一半

Others

现在也有issue提到了这个问题#4916
不知道是不是llama-factory不太支持多机训练

hiyouga · 2024-11-22T08:58:00Z

多机之间有通信延迟，使用 zero2 试试

github-actions bot added the pending This problem is yet to be addressed label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多机训练的训练速度和单机一样 #6111

多机训练的训练速度和单机一样 #6111

Wiselnn570 commented Nov 22, 2024

hiyouga commented Nov 22, 2024

多机训练的训练速度和单机一样 #6111

多机训练的训练速度和单机一样 #6111

Comments

Wiselnn570 commented Nov 22, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Nov 22, 2024