Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL-7B 图文微调数据与纯文本微调数据混合训练,loss跌为0 #6159

Open
1 task done
VincentVanNF opened this issue Nov 27, 2024 · 0 comments
Open
1 task done
Labels
pending This problem is yet to be addressed

Comments

@VincentVanNF
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
  • Python version: 3.9.19
  • PyTorch version: 2.4.0+cu121 (GPU)
  • Transformers version: 4.45.0.dev0
  • Datasets version: 2.21.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A100-SXM4-80GB
  • DeepSpeed version: 0.14.4
  • vLLM version: 0.6.1

Reproduction

图文数据集长度 66081
纯文本数据集长度 423
使用的是sharegpt格式数据,两个数据集的dataset_info分别是:

"data_mixed_unimodal": {
        "file_name": "data_mixed_unimodal.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "messages",
            "images": "images"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant",
            "system_tag": "system"
        }
    },
"data_mixed_unimodal_txt": {
        "file_name": "data_mixed_unimodal_txt.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "messages"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant",
            "system_tag": "system"
        }
}

训练命令为:

dataset="data_mixed_unimodal,data_mixed_unimodal_txt"
DS_CONFIG_PATH=${BASE_PATH}/LLaMA-Factory/examples/deepspeed/ds_z2_config.json
torchrun $DISTRIBUTED_ARGS src/train.py \
--deepspeed $DS_CONFIG_PATH \
--stage sft \
--do_train \
--model_name_or_path $MODEL_PATH \
--dataset_dir $DATASET \
--dataset $dataset \
--template qwen2_vl \
--finetuning_type full \
--output_dir $OUTPUT_PATH \
--overwrite_cache \
--overwrite_output_dir \
--warmup_ratio 0.1 \
--weight_decay 0.08 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--ddp_timeout 18000000 \
--learning_rate 4e-6 \
--lr_scheduler_type cosine \
--logging_steps 200 \
--cutoff_len ${CUT_OFF} \
--save_strateg epoch \
--plot_loss \
--num_train_epochs 3 \
--bf16 \
--image_resolution 448 \
--fix_embedding False \
-fix_vit False \
--attn_implementation $attn_implementation \
--report_to none

Expected behavior

希望能够正常混合训练纯文本输入和多模态输入数据,但是最后模型的loss 跌为了0且模型infer时只输出问号。log文件为:

{'loss': 2.4421278141604166e+27, 'grad_norm': nan, 'learning_rate': 2.5641025641025644e-06, 'epoch': 0.19}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9902938328141285e-06, 'epoch': 0.38}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.896854514436596e-06, 'epoch': 0.58}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.7086363653163876e-06, 'epoch': 0.77}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.4350439528372386e-06, 'epoch': 0.96}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.0897476817442102e-06, 'epoch': 1.15}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.690000734389941e-06, 'epoch': 1.35}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.2557769927853283e-06, 'epoch': 1.54}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.8087730173132427e-06, 'epoch': 1.73}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.371323949551559e-06, 'epoch': 1.92}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.652875075468517e-07, 'epoch': 2.12}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.109518361827841e-07, 'epoch': 2.31}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.2602178333604456e-07, 'epoch': 2.5}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.24734253865034e-07, 'epoch': 2.69}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.714684393284638e-08, 'epoch': 2.89}

请问这是什么原因导致的,如果想要正常的混合数据训练需要怎么做?

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant