Skip to content

deepglint/Victor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Arxiv Hugging Face

842f5fe2-84ad-464a-83d6-408bf1a0d9fa.webp

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model’s (LLM’s) understanding of visual information.

After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEED_I, and RealWorldQA benchmarks, respectively. We will release the code and model weights to facilitate reproducibility.

📜 News

[2025/8/15] The paper, code and weights are released!💥

👨‍💻 Todo

  • Better model base on ViCToR
  • Checkpoints of ViCToR-7B
  • Training code for ViCToR

🤖 Model Zoo

Benchmark ViCTOR-7B LLaVA-1.5-13B LLaVA-NeXT-8B Ross
MMStar 54.3 34.3 43.9 53.9
RealWorldQA 65.6 55.3 58.4 58.7
MMBench^(cn,val) 79.0 67.8
OCRBench 556 337 531 553
POPE 88.4 88.4 87.1 88.1
MMU 48.9 37.0 43.1 49.0
A12D 79.5 61.1 72.8 79.5
MME 2071 1781 1908 1854
SEED^(f) 75.7 68.2 72.5 73.6

📊 Visualization

842f5fe2-84ad-464a-83d6-408bf1a0d9fa.webp

842f5fe2-84ad-464a-83d6-408bf1a0d9fa.webp

842f5fe2-84ad-464a-83d6-408bf1a0d9fa.webp

Install

git clone https://github.com/deepglint/Victor.git
cd Victor
conda create -n victor python=3.10 -y
conda activate victor

pip install --upgrade pip
pip install -e .
pip install flash-attn --no-build-isolation

Training

Stage 1: Pretraining MLP

bash scripts/train/stage1_pretrain_siglip.sh

Stage 1.5: Pretraining ViCToR

bash scripts/train/stage1.5_caption_siglip_qwen_victor.sh

Stage 2: Instructional Finetuning

bash scripts/train/stage2_finetune_siglip_qwen.sh

Citation

@misc{xie2024crocpretraininglargemultimodal,
      title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs}, 
      author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng},
      year={2024},
      eprint={2410.14332},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.14332}, 
}

Acknowledgement

We extend our deepest gratitude to the creators and contributors of the following projects:

  1. LLaVA-NeXT: The comprehensive codebase for training Vision-Language Models (VLMs).

Their exceptional work has been instrumental to our research and development efforts.

About

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7