Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机训练的训练速度和单机一样 #6111

Open
1 task done
Wiselnn570 opened this issue Nov 22, 2024 · 1 comment
Open
1 task done

多机训练的训练速度和单机一样 #6111

Wiselnn570 opened this issue Nov 22, 2024 · 1 comment
Labels
pending This problem is yet to be addressed

Comments

@Wiselnn570
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

在total_batch_size相同的情况下,单机(8卡)训练速度和多机(16卡)一样。对于想使用这个仓库scale数据规模成了阻碍

Reproduction

使用的torchrun调用
脚本为

#!/bin/bash
set -x -e
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
# export CUDA_LAUNCH_BLOCKING=1
echo "PYTHONPATH: ${PYTHONPATH}"
which_python=$(which python)
echo "which python: ${which_python}"
export PYTHONPATH=${PYTHONPATH}:${which_python}
export PYTHONPATH=${PYTHONPATH}:.
echo "PYTHONPATH: ${PYTHONPATH}"
export NNODES=2
export num_gpus=8
export WANDB_DISABLED=true
export full_batch_size=128
export batch_size=1
export gradient_accumulation_steps=$[$full_batch_size/($batch_size*$num_gpus*$NNODES)]
export CPUS_PER_TASK=20
export MASTER_PORT=$((RANDOM % 101 + 29400))
## slurm
export PARTITION=mllm
export JOB_NAME=rope
export QUOTA_TYPE=spot
export output_dir=/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-m_rope-16card-test
export model_name_or_path=/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct-with-Qwen2-Language-Backbone
srun -p ${PARTITION} \
    --job-name=${JOB_NAME} \
    --gres=gpu:${num_gpus} \
    --time=2-00:00:00 \
    --nodes=${NNODES} \
    --ntasks-per-node=1 \
    --cpus-per-task=${CPUS_PER_TASK} \
    bash -c 'torchrun \
    --nnodes $NNODES \
    --nproc_per_node ${num_gpus:-1} \
    --node_rank="${SLURM_NODEID}" \
    --master_addr=$(scontrol show hostname $SLURM_NODELIST | head -n1) \
    --master_port=$MASTER_PORT \
    /mnt/petrelfs/weixilin/projects/MLLM/LLaMA-Factory/src/train.py \
    --deepspeed examples/deepspeed/ds_z3_config.json \
    --tokenized_path /mnt/petrelfs/weixilin/cache/training_qwen2vl_pretokenized_data-8k-context/ \
    --model_name_or_path $model_name_or_path \
    --stage sft \
    --do_train true \
    --finetuning_type full \
    --dataset shot2story_caption,textvr_caption,youcook2_caption,videochat_caption,k710_classification,videochat1_conversation,videochat2_conversation,videochatgpt_conversation,clevr_mc,ego_qa,tgif_frame_qa,clevr_qa \
    --template qwen2_vl \
    --cutoff_len 32768 \
    --overwrite_cache true \
    --preprocessing_num_workers 64 \
    --output_dir $output_dir \
    --num_train_epochs 1.0 \
    --logging_steps 1 \
    --save_steps 2500 \
    --plot_loss true \
    --overwrite_output_dir true \
    --per_device_train_batch_size $batch_size \
    --gradient_accumulation_steps $gradient_accumulation_steps \
    --learning_rate 1.0e-5 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 true \
    --ddp_timeout 180000000 \
    --val_size 1 \
    --per_device_eval_batch_size 1 \
    --eval_strategy steps \
    --eval_steps 500 \
    --flash_attn fa2 \
    --report_to none'

训练日志16卡
image
8卡
image

Expected behavior

多机的训练时间比单机缩短一半

Others

现在也有issue提到了这个问题#4916
不知道是不是llama-factory不太支持多机训练

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 22, 2024
@hiyouga
Copy link
Owner

hiyouga commented Nov 22, 2024

多机之间有通信延迟,使用 zero2 试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants