ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model’s (LLM’s) understanding of visual information.

After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEED_I, and RealWorldQA benchmarks, respectively. We will release the code and model weights to facilitate reproducibility.

📜 News

[2025/8/15] The paper, code and weights are released!💥

👨‍💻 Todo

Better model base on ViCToR
Checkpoints of ViCToR-7B
Training code for ViCToR

🤖 Model Zoo

Benchmark	ViCTOR-7B	LLaVA-1.5-13B	LLaVA-NeXT-8B	Ross
MMStar	54.3	34.3	43.9	53.9
RealWorldQA	65.6	55.3	58.4	58.7
MMBench^(cn,val)	79.0	67.8	–	–
OCRBench	556	337	531	553
POPE	88.4	88.4	87.1	88.1
MMU	48.9	37.0	43.1	49.0
A12D	79.5	61.1	72.8	79.5
MME	2071	1781	1908	1854
SEED^(f)	75.7	68.2	72.5	73.6

📊 Visualization

Install

git clone https://github.com/deepglint/Victor.git
cd Victor
conda create -n victor python=3.10 -y
conda activate victor

pip install --upgrade pip
pip install -e .
pip install flash-attn --no-build-isolation

Training

Stage 1: Pretraining MLP

bash scripts/train/stage1_pretrain_siglip.sh

Stage 1.5: Pretraining ViCToR

bash scripts/train/stage1.5_caption_siglip_qwen_victor.sh

Stage 2: Instructional Finetuning

bash scripts/train/stage2_finetune_siglip_qwen.sh

Citation

@misc{xie2024crocpretraininglargemultimodal,
      title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs}, 
      author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng},
      year={2024},
      eprint={2410.14332},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.14332}, 
}

Acknowledgement

We extend our deepest gratitude to the creators and contributors of the following projects:

LLaVA-NeXT: The comprehensive codebase for training Vision-Language Models (VLMs).

Their exceptional work has been instrumental to our research and development efforts.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

📜 News

👨‍💻 Todo

🤖 Model Zoo

📊 Visualization

Install

Training

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

deepglint/Victor

Folders and files

Latest commit

History

Repository files navigation

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

📜 News

👨‍💻 Todo

🤖 Model Zoo

📊 Visualization

Install

Training

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages