COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
🚧 Under Construction 🚧
Authors: Xindi Wu*, Hee Seung Hwang*, Polina Kirichenko, Olga Russakovsky
(* Denotes equal contribution)
We propose COMPACT, a data recipe that scales capabilities of Multimodal Large Language Models (MLLMs) from atomic (k = 1) to complex (k > 1) compositional levels. By creating a training dataset with a balanced distribution of compositional complexity, COMPACT enables models to learn complex capabilities more efficiently.
- [04/30] We have released the COMPACT data recipe for visual compositional tuning.
First, clone the repository and navigate to the project directory:
git clone https://github.com/princetonvisualai/compact.git
cd compactTo set up the environment for COMPACT training (following LLaVA (https://github.com/haotian-liu/LLaVA)):
conda create -n compact python=3.10 -y
conda activate compact
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolationCOMPACT provides a balanced distribution of training examples across different compositional complexity levels (k = 1, 2, 3). The datasets include:
- Compositional Tuning Data - Questions generated to require exactly k specific atomic capabilities
- Instruction Tuning Data - A random subset of LLaVA-665K for maintaining instruction following capabilities
COMPACT divides visual capabilities into 10 atomic categories:
- Attribution: color, shape
- Recognition: object recognition, action recognition, text recognition, spatial recognition, counting
- Relation: spatial relationship, object interaction, scene understanding
Each question in our dataset explicitly requires the integration of k specific atomic capabilities.
You can generate your own compositional training data by running:
python compact/main.py \
--image_dir path/to/images \
--output_dir path/to/output \
--k 3 \
--num_samples 1000 \
--api_key YOUR_GEMINI_API_KEY \
--processes 32
(🚧 Under Construction 🚧)
Train a model with the COMPACT dataset:
sh train_scripts/train_compact.sh(🚧 Under Construction 🚧)
Evaluate your COMPACT-trained model on various benchmarks:
python evaluation/evaluate_model.py \
--model_path path/to/model \
--benchmark mmvet,mmstar,seedbench2plus,infovqa,textvqa,mme,cvbench,llava-wildWith only 10% of the data used in LLaVA-665K, COMPACT achieves comparable performance across standard benchmarks:
| Model | Data Size | InfoVQA | SeedBench2+ | MME | TextVQA | MM-Vet | CV-Bench | MMStar | LLaVA-W | Rel. (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-665K | 665K | 20.80 | 41.72 | 1478.48 | 46.99 | 29.22 | 60.92 | 35.11 | 68.50 | 100.00 |
| COMPACT (ours) | 65K | 23.68 | 43.13 | 1379.94 | 44.37 | 31.74 | 55.28 | 36.13 | 64.50 | 100.18 |
If you find this repository useful for your research, please cite our paper:
@article{wu2025compact,
title={COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning},
author={Wu, Xindi and Hwang, Hee Seung and Kirichenko, Polina and Russakovsky, Olga},
journal={arXiv preprint arXiv:2504.21850},
year={2025}
}
