Training with more tokens #5

Jingfeng0705 · 2024-11-25T22:56:52Z

Hello, authors! Although quecc is designed for very low token counts, I am curious whether you have tested quecc in a higher token count range (the highest in the paper is 36). When I trained with 144 tokens (using the same training hyperparameters as for 36), I observed spikes in the loss. Have you tried training with 144 tokens? Thank you!

kevinli573 · 2024-11-26T18:18:10Z

We didn't test out anything higher than what was reported in the paper, since the design was motivated by our scaling laws which shows that for visual reasoning and understanding tasks, less visual tokens with more LLM parameters is more compute optimal.

We build upon the TokenPacker (https://github.com/CircleRadon/TokenPacker) compression algorithm, which does test on 144 tokens.

SachinG007 · 2024-11-28T12:37:16Z

I think the increase in loss after 2k steps is quite unexpected. I don't think it is expected at 144 tokens. Can you try to check your setup say for 36 tokens?

Jingfeng0705 · 2024-11-29T02:51:18Z

Hi, happy Thanksgiving and so many thanks for your responses! The training script I used is attached here:


python script_name.py \
    --lora_enable True \
    --lora_r 128 \
    --lora_alpha 256 \
    --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path lmsys/vicuna-7b-v1.5 \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b-clanf-pretrain-1e-3-noprompt-144 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-clanf-lora-2e-4-1e-5-noprompt-144 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --run_name CLanF_finetune_lora_noprompt_2e-4_1e-5_144 \
    --mm_vision_token_compression_type quecc \
    --mm_vision_output_text_embedding_size 4096 \
    --mm_vision_output_token_count 576 \
    --mm_vision_token_compression_kernel_size 2 \
    --mm_vision_token_compression_stride 2

It is trained on 8 RTX3090, so the total gpu_num * batch_per_gpu * gradient_accumulation_step is the same. Two main differences with QueCC implementation: 1. fine-tuned with LoRA 2. I deleted the step of adding prompt embedding to downsized x since I want to do some ablation. The same script is used for 36 tokens training (except for kernel size and stride be 4), and I got pretty good results:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with more tokens #5

Training with more tokens #5

Jingfeng0705 commented Nov 25, 2024

kevinli573 commented Nov 26, 2024

SachinG007 commented Nov 28, 2024

Jingfeng0705 commented Nov 29, 2024

Training with more tokens #5

Training with more tokens #5

Comments

Jingfeng0705 commented Nov 25, 2024

kevinli573 commented Nov 26, 2024

SachinG007 commented Nov 28, 2024

Jingfeng0705 commented Nov 29, 2024