Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] Inference Speed Issue with LoRA Fine-tuned Model on ScienceQA #1763

Open
jinghanSunn opened this issue Nov 12, 2024 · 0 comments
Open

Comments

@jinghanSunn
Copy link

jinghanSunn commented Nov 12, 2024

Hi Haotian,

Thank you for your incredible work on this project.

I am encountering an issue during inference. When I use the non-LoRA weights for inference on ScienceQA, the speed is approximately 1 second per sample. However, when I switch to the LoRA fine-tuned model, the inference speed drastically increases to over 40 seconds per sample.

Here is the command I am using for fine-tuning (trained on 1 V100 with lora_r=4, bf16=False, tf32=False):

CUDA_VISIBLE_DEVICES=1 python3 llava/train/train.py \
    --lora_enable True --lora_r 4 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --model_name_or_path ./LLAVA-1.5/llava-v1.5-7b/ \
    --version v1 \
    --data_path ./playground/data/eval/scienceqa/llava_train_CQM-A.json \
    --image_folder ./data/ScienceQA/image/train/ \
    --vision_tower ./data/clip-vit-large-patch14-336/ \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --output_dir ./LLaVA-v1.5-7b-lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Here is the command I am using for inference:

CUDA_VISIBLE_DEVICES=3 python3 -m llava.eval.model_vqa_science \
    --model-path ./LLaVA-v1.5-7b-lora/checkpoint-50000/ \
    --model-base ./LLAVA-1.5/llava-v1.5-7b/ \
    --question-file ./playground/data/eval/scienceqa/llava_test_CQM-A.json \
    --image-folder ./data/ScienceQA/image/test/ \
    --answers-file ./playground/data/eval/scienceqa/answers/llava-v1.5-7b-lora-50000.jsonl \
    --single-pred-prompt \
    --temperature 0 \
    --conv-mode vicuna_v1

Could you please help me understand why the inference speed difference between the two models is significant?

Thank you!

Screenshots:
屏幕截图 2024-11-12 144626

adapter_config.json:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "./data/LLAVA-1.5/llava-v1.5-7b/",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 256,
  "lora_dropout": 0.05,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 4,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "down_proj",
    "o_proj",
    "q_proj",
    "gate_proj",
    "up_proj",
    "v_proj",
    "k_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false

config.json

{
  "_name_or_path": "./data/LLAVA-1.5/llava-v1.5-7b/",
  "architectures": [
    "LlavaLlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  "freeze_mm_vision_resampler": false,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "mm_hidden_size": 1024,
  "mm_patch_merge_type": "flat",
  "mm_projector_lr": 2e-05,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "./data/clip-vit-large-patch14-336/",
  "model_type": "llava_llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "float16",
  "transformers_version": "4.37.2",
  "tune_mm_mlp_adapter": false,
  "tune_mm_vision_resampler": false,
  "unfreeze_mm_vision_tower": false,
  "use_cache": true,
  "use_mm_proj": true,
  "vocab_size": 32000
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant