-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anyres compatible fine-tuning of llava-1.6 mistral 7b and 34b #1347
base: main
Are you sure you want to change the base?
Conversation
that is a good pr.. i am finetune this pr ..thanks . |
Ofc, glad you found it useful! I'm sure the author's version is far superior (<3 llava), but wanted to leave this here for others to use until we get the real magic :) |
@arielnlee I encountered an issue during the training process. I am using the LoRA fine-tuning method, and my data consists of two parts: lots of Pure question-answering dialogues. During training, I found that the training speed for the first part of the dataset is very slow, and it is as slow as the first part. After investigation, I found that the reason is: 在train.py的LazySupervisedDataset 类的__getitem__ 方法:
when i delete this code 👍 elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
data_dict['image_size'] = crop_size
return data_dict the train process error: Traceback (most recent call last):
File "/export/App/training_platform/PinoModel/LLaVA/llava/train/train_mem.py", line 9, in <module>
train(attn_implementation="flash_attention_2")
File "/export/App/training_platform/PinoModel/LLaVA/llava/train/train.py", line 1092, in train
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1854, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2744, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1964, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2152, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 491, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn how to fine-tuning for the pure text stage with very fast speed? Can I do this? ? ? Finally trained a good llava model |
i got adapter_model.safetensor instead of adapter_model.bin after LORA FInetuning of 1.6-mistral and getting error as when trying to merge model |
@rohithbojja may be mode_path is error 。。 please give me :model_path, mode_base args |
nohup python scripts/merge_lora_weights.py --model-path=../checkpoints/llava-v1.6-34b-xxx-lora-5000 --model-base=../checkpoints/llava-v1.6-34b --save-model-path=../checkpoints/llava-v1.6-34b-xxx-5000 & |
ive fixed my adding "lora" to model-path |
Can you please provide some example of your training data?
I was wondering why you chose to add a new conversation format. I was trying to tune based on your PR and my existing data made for LLaVA 1.5 fine-tuning which uses 'V1' version, but currently running into issues where the tokenizer length is mismatching |
Checkout my Wandb logs. https://wandb.ai/21b81a66a5/huggingface/runs/4pslu1px/overview?nw=nwuser21b81a66a5 And my notebook used to train https://colab.research.google.com/drive/10OG4JsmSZ6kd8pyDxxhjHWkhK2ZOgVH4 |
#!/bin/bash deepspeed llava/train/train_mem.py
ValueError: and |
Can anyone share the filtered_dataset json for the 34b training?
Yours,
Alex
…On Apr 24, 2024 at 4:42 AM -0700, Rohith Bojja ***@***.***>, wrote:
#!/bin/bash
deepspeed llava/train/train_mem.py
--lora_enable True --lora_r 16 --lora_alpha 32 --mm_projector_lr 2e-5
--deepspeed ./scripts/zero2.json
--model_name_or_path /home/rohith/llava-v1.6-mistral-7b-bnb-4bit/
--version mistral_instruct
--data_path /home/rohith/Desktop/vqa/vqa/images/filtered_dataset.json
--image_folder /home/rohith/Desktop/vqa/vqa/images/
--vision_tower openai/clip-vit-large-patch14-336
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--mm_patch_merge_type spatial_unpad
--image_aspect_ratio anyres
--group_by_modality_length False
--bf16 False
--fp16 True
--output_dir /home/rohith/LLaVA-1.6-ft/llava_lora_mistral_med/
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 500
--save_total_limit 5
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.05
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 4096
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--report_to wandb \
using this script gives me error
ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.
and
using original model doesnt give any error,
used
panoyo9829/llava-v1.6-mistral-7b-bnb-4bit model
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
@findalexli https://drive.google.com/file/d/1gYLOFaz7Mn-E2u9ksT0R2BOai7MnmNcm/view?usp=drivesdk It has following structure 1 image is truncated, from PIL import Image trunk_ = 0 def check_for_truncated_images(directory,trunk_): directory_path = '/workspace/vqa/images' Also remove the entry in json. Good luck |
Hi, Thanks for working on private version of anyres llava. I have done fintuning vicuna-v1.5-7b with anyres / spatial_unpad in same configuration as above, but the result doesn't seem to work out well on lmms-eval with MME score 357 / 224 (LLaVA-v1.5-7B : 1519 / 332). Have you done some evaluation on public benchmarks and got similar score? |
Hi! Thanks for sharing! |
@findalexli Hi there, did you find out why? I used the new template and ran into tokenization mismatch errors so i'm going to try with v1 now. Let me know if you managed to finetune a llava1.6! :) |
Lowrank fine-tuning with anyres for the LLaVA Next models :)