Qwen2-VL-7B 图文微调数据与纯文本微调数据混合训练,loss跌为0 #6159

VincentVanNF · 2024-11-27T08:07:48Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
Python version: 3.9.19
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-80GB
DeepSpeed version: 0.14.4
vLLM version: 0.6.1

Reproduction

图文数据集长度 66081
纯文本数据集长度 423
使用的是sharegpt格式数据,两个数据集的dataset_info分别是:

"data_mixed_unimodal": {
        "file_name": "data_mixed_unimodal.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "messages",
            "images": "images"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant",
            "system_tag": "system"
        }
    },
"data_mixed_unimodal_txt": {
        "file_name": "data_mixed_unimodal_txt.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "messages"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant",
            "system_tag": "system"
        }
}

训练命令为:

dataset="data_mixed_unimodal,data_mixed_unimodal_txt"
DS_CONFIG_PATH=${BASE_PATH}/LLaMA-Factory/examples/deepspeed/ds_z2_config.json
torchrun $DISTRIBUTED_ARGS src/train.py \
--deepspeed $DS_CONFIG_PATH \
--stage sft \
--do_train \
--model_name_or_path $MODEL_PATH \
--dataset_dir $DATASET \
--dataset $dataset \
--template qwen2_vl \
--finetuning_type full \
--output_dir $OUTPUT_PATH \
--overwrite_cache \
--overwrite_output_dir \
--warmup_ratio 0.1 \
--weight_decay 0.08 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--ddp_timeout 18000000 \
--learning_rate 4e-6 \
--lr_scheduler_type cosine \
--logging_steps 200 \
--cutoff_len ${CUT_OFF} \
--save_strateg epoch \
--plot_loss \
--num_train_epochs 3 \
--bf16 \
--image_resolution 448 \
--fix_embedding False \
-fix_vit False \
--attn_implementation $attn_implementation \
--report_to none

Expected behavior

希望能够正常混合训练纯文本输入和多模态输入数据，但是最后模型的loss 跌为了0且模型infer时只输出问号。log文件为:

{'loss': 2.4421278141604166e+27, 'grad_norm': nan, 'learning_rate': 2.5641025641025644e-06, 'epoch': 0.19}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9902938328141285e-06, 'epoch': 0.38}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.896854514436596e-06, 'epoch': 0.58}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.7086363653163876e-06, 'epoch': 0.77}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.4350439528372386e-06, 'epoch': 0.96}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.0897476817442102e-06, 'epoch': 1.15}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.690000734389941e-06, 'epoch': 1.35}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.2557769927853283e-06, 'epoch': 1.54}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.8087730173132427e-06, 'epoch': 1.73}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.371323949551559e-06, 'epoch': 1.92}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.652875075468517e-07, 'epoch': 2.12}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.109518361827841e-07, 'epoch': 2.31}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.2602178333604456e-07, 'epoch': 2.5}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.24734253865034e-07, 'epoch': 2.69}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.714684393284638e-08, 'epoch': 2.89}

请问这是什么原因导致的，如果想要正常的混合数据训练需要怎么做？

Others

No response

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2-VL-7B 图文微调数据与纯文本微调数据混合训练,loss跌为0 #6159

Qwen2-VL-7B 图文微调数据与纯文本微调数据混合训练,loss跌为0 #6159

VincentVanNF commented Nov 27, 2024

Qwen2-VL-7B 图文微调数据与纯文本微调数据混合训练,loss跌为0 #6159

Qwen2-VL-7B 图文微调数据与纯文本微调数据混合训练,loss跌为0 #6159

Comments

VincentVanNF commented Nov 27, 2024

Reminder

System Info

Reproduction

Expected behavior

Others