Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss=0 and grad_norm=nan when fine-tuning llava-v1.5-7b using dpo.sh #359

Open
Na-nata opened this issue Dec 5, 2024 · 1 comment
Open

Comments

@Na-nata
Copy link

Na-nata commented Dec 5, 2024

Why do I encounter 'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0', dtype=torch.float64) when fine-tuning llava-v1.5-7b using the dpo code from the llava-next repository? Below is my training script, and I have ensured that my training dataset is fine.
image

export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=${ARNOLD_RDMA_DEVICE}
export NCCL_SOCKET_IFNAME=lo
export NCCL_DEBUG=INFO

VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION////_}"
MID_RUN_NAME="llava-1.5-7b-dpo-v1"
############### Pretrain ################

Stage 2
PROMPT_VERSION="v1"

#torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}"
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node=4 --nnodes=1 --node_rank="${RANK}" --master_addr=30.246.96.60 --master_port=23456
llava/train/train_dpo.py
--deepspeed scripts/zero3.json
--model_name_or_path "/model_weight/liuhaotian--llava-v1.5-7b.main.4481d270cc22fd5c4d1bb5df129622006ccd9234"
--version $PROMPT_VERSION
--dpo_alpha 1.0 --beta 0.1 --gamma 0
--data_path=processed_data
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model"
--vision_tower ${VISION_MODEL_VERSION}
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--mm_spatial_pool_stride 2
--mm_resampler_type "spatial_pool"
--mm_spatial_pool_out_channels 1024
--group_by_modality_length True
--image_aspect_ratio pad
--bf16 True
--run_name $MID_RUN_NAME
--output_dir "llava1_5_dpo/${MID_RUN_NAME}"
--num_train_epochs 1
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 3000
--save_total_limit 1
--learning_rate 5e-7
--weight_decay 0.
--warmup_ratio 0.1
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 32768
--gradient_checkpointing True
--dataloader_num_workers 16
--lazy_preprocess True
--report_to "none"
--torch_compile True
--torch_compile_backend "inductor"
--dataloader_drop_last True

@weiaicunzai
Copy link

have you solved this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants