We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在total_batch_size相同的情况下,单机(8卡)训练速度和多机(16卡)一样。对于想使用这个仓库scale数据规模成了阻碍
使用的torchrun调用 脚本为
#!/bin/bash set -x -e export NCCL_SOCKET_IFNAME=eth0 export NCCL_DEBUG=INFO # export CUDA_LAUNCH_BLOCKING=1 echo "PYTHONPATH: ${PYTHONPATH}" which_python=$(which python) echo "which python: ${which_python}" export PYTHONPATH=${PYTHONPATH}:${which_python} export PYTHONPATH=${PYTHONPATH}:. echo "PYTHONPATH: ${PYTHONPATH}" export NNODES=2 export num_gpus=8 export WANDB_DISABLED=true export full_batch_size=128 export batch_size=1 export gradient_accumulation_steps=$[$full_batch_size/($batch_size*$num_gpus*$NNODES)] export CPUS_PER_TASK=20 export MASTER_PORT=$((RANDOM % 101 + 29400)) ## slurm export PARTITION=mllm export JOB_NAME=rope export QUOTA_TYPE=spot export output_dir=/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-m_rope-16card-test export model_name_or_path=/mnt/hwfile/mllm/weixilin/cache/Qwen2-VL-7B-Instruct-with-Qwen2-Language-Backbone srun -p ${PARTITION} \ --job-name=${JOB_NAME} \ --gres=gpu:${num_gpus} \ --time=2-00:00:00 \ --nodes=${NNODES} \ --ntasks-per-node=1 \ --cpus-per-task=${CPUS_PER_TASK} \ bash -c 'torchrun \ --nnodes $NNODES \ --nproc_per_node ${num_gpus:-1} \ --node_rank="${SLURM_NODEID}" \ --master_addr=$(scontrol show hostname $SLURM_NODELIST | head -n1) \ --master_port=$MASTER_PORT \ /mnt/petrelfs/weixilin/projects/MLLM/LLaMA-Factory/src/train.py \ --deepspeed examples/deepspeed/ds_z3_config.json \ --tokenized_path /mnt/petrelfs/weixilin/cache/training_qwen2vl_pretokenized_data-8k-context/ \ --model_name_or_path $model_name_or_path \ --stage sft \ --do_train true \ --finetuning_type full \ --dataset shot2story_caption,textvr_caption,youcook2_caption,videochat_caption,k710_classification,videochat1_conversation,videochat2_conversation,videochatgpt_conversation,clevr_mc,ego_qa,tgif_frame_qa,clevr_qa \ --template qwen2_vl \ --cutoff_len 32768 \ --overwrite_cache true \ --preprocessing_num_workers 64 \ --output_dir $output_dir \ --num_train_epochs 1.0 \ --logging_steps 1 \ --save_steps 2500 \ --plot_loss true \ --overwrite_output_dir true \ --per_device_train_batch_size $batch_size \ --gradient_accumulation_steps $gradient_accumulation_steps \ --learning_rate 1.0e-5 \ --lr_scheduler_type cosine \ --warmup_ratio 0.1 \ --bf16 true \ --ddp_timeout 180000000 \ --val_size 1 \ --per_device_eval_batch_size 1 \ --eval_strategy steps \ --eval_steps 500 \ --flash_attn fa2 \ --report_to none'
训练日志16卡 8卡
多机的训练时间比单机缩短一半
现在也有issue提到了这个问题#4916 不知道是不是llama-factory不太支持多机训练
The text was updated successfully, but these errors were encountered:
多机之间有通信延迟,使用 zero2 试试
Sorry, something went wrong.
No branches or pull requests
Reminder
System Info
在total_batch_size相同的情况下,单机(8卡)训练速度和多机(16卡)一样。对于想使用这个仓库scale数据规模成了阻碍
Reproduction
使用的torchrun调用
脚本为
训练日志16卡
8卡
Expected behavior
多机的训练时间比单机缩短一半
Others
现在也有issue提到了这个问题#4916
不知道是不是llama-factory不太支持多机训练
The text was updated successfully, but these errors were encountered: