Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with more tokens #5

Open
Jingfeng0705 opened this issue Nov 25, 2024 · 3 comments
Open

Training with more tokens #5

Jingfeng0705 opened this issue Nov 25, 2024 · 3 comments

Comments

@Jingfeng0705
Copy link

Hello, authors! Although quecc is designed for very low token counts, I am curious whether you have tested quecc in a higher token count range (the highest in the paper is 36). When I trained with 144 tokens (using the same training hyperparameters as for 36), I observed spikes in the loss. Have you tried training with 144 tokens? Thank you!
Screenshot 2024-11-25 at 14 52 51

@kevinli573
Copy link
Collaborator

We didn't test out anything higher than what was reported in the paper, since the design was motivated by our scaling laws which shows that for visual reasoning and understanding tasks, less visual tokens with more LLM parameters is more compute optimal.

We build upon the TokenPacker (https://github.com/CircleRadon/TokenPacker) compression algorithm, which does test on 144 tokens.

@SachinG007
Copy link
Collaborator

I think the increase in loss after 2k steps is quite unexpected. I don't think it is expected at 144 tokens. Can you try to check your setup say for 36 tokens?

@Jingfeng0705
Copy link
Author

Hi, happy Thanksgiving and so many thanks for your responses! The training script I used is attached here:


python script_name.py \
    --lora_enable True \
    --lora_r 128 \
    --lora_alpha 256 \
    --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path lmsys/vicuna-7b-v1.5 \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b-clanf-pretrain-1e-3-noprompt-144 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-clanf-lora-2e-4-1e-5-noprompt-144 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --run_name CLanF_finetune_lora_noprompt_2e-4_1e-5_144 \
    --mm_vision_token_compression_type quecc \
    --mm_vision_output_text_embedding_size 4096 \
    --mm_vision_output_token_count 576 \
    --mm_vision_token_compression_kernel_size 2 \
    --mm_vision_token_compression_stride 2

It is trained on 8 RTX3090, so the total gpu_num * batch_per_gpu * gradient_accumulation_step is the same. Two main differences with QueCC implementation: 1. fine-tuned with LoRA 2. I deleted the step of adding prompt embedding to downsized x since I want to do some ablation. The same script is used for 36 tokens training (except for kernel size and stride be 4), and I got pretty good results:
Screenshot 2024-11-28 at 18 50 22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants