🚀 Welcome to the repo of ViSA!
ViSA (Visual-Centric Data Selection with Collaborative Agents) is an open-source project designed to enhance visual data selection through collaborative agents.
- Model Release
- Data Release
- Code Release
To ensure smooth integration with external dependencies, we recommend setting up separate virtual environments for different components of the project.
conda create -n vllm python=3.11
conda activate vllm
pip install -r vllm_requirements.txt
Note: Due to existing bugs in the current VLLM main
branch when using Qwen2-VL, we recommend using the VLLM dev
branch instead.
conda create -n qwen_vllm python=3.11
conda activate qwen_vllm
pip install -r qwen_vllm_requirements.txt
conda create -n sam python=3.11
conda activate sam
pip install -r sam_requirements.txt
We provide a simple training environment for running experiments. However, we also encourage the use of more efficient training frameworks like LLama-Factory.
pip install -r training_requirements.txt
We use the following large vision-language models as visual agents. Please manually download them before running the experiments:
- Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4
- OpenGVLab/InternVL2_5-78B-AWQ
- OpenGVLab/InternVL2_5-78B-MPO-AWQ
- llava-hf/llava-onevision-qwen2-72b-ov-chat-hf
We rely on the following open-source projects. Please install them according to their official guidelines:
conda activate sam
# install sam2
git clone https://github.com/facebookresearch/sam2.git && cd sam2-main
pip install -e .
# install grounded-sam2
git clone https://github.com/IDEA-Research/Grounded-SAM-2.git && cd Grounded-SAM-2-main
pip install -e .
pip install --no-build-isolation -e grounding_dino
We provide five reference scripts for data selection. Before running them, please ensure that all necessary parameters (e.g., model paths, save directories) are correctly specified.
conda activate sam
bash Scrpit/SC_score.sh
conda activate sam
bash Scrpit/OA_score.sh
conda activate vllm # dev_vllm for qwen
bash Scrpit/DP_score.sh
conda activate vllm # dev_vllm for qwen
bash Scrpit/PT_IM_score.sh
You can download our dataset here. We provide two versions of the data: ViSA-LlavaOV-80K and ViSA-LlavaOV-700K.
The 80K dataset can be used for small-scale multimodal model alignment or replicating the experiments in our paper, while the 700K dataset is suitable for large-scale multimodal model alignment.
Due to capacity limitations for new accounts on Huggingface, we are temporarily unable to upload data containing images. To obtain the image data, please download the original Llava-OneVision dataset.
Our visual-semantic alignment models based on Qwen2-VL-2B architecture are available for academic research:
- Qwen2-VL-2B-ViSA-80K: Trained on ViSA-LlavaOV-80K dataset, specifically calibrated for reproducing experimental results in our publication.
- Qwen2-VL-2B-Instruction-ViSA-700K: Enhanced through ViSA-LlavaOV-700K training, demonstrating superior multi-modal reasoning compared to its base instruction model.
(WIP) We will publish the detailed evaluation soon
For any questions, issues, or contributions, feel free to open an issue or submit a pull request.
If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
@article{liu2025picking,
title={Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents},
author={Liu, Zhenyu and Li, Yunxin and Hu, Baotian and Luo, Wenhan and Wang, Yaowei and Zhang, Min},
journal={arXiv preprint arXiv:2502.19917},
year={2025}
}