Skip to content

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Notifications You must be signed in to change notification settings

om-ai-lab/ZoomEye

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Zoom Eye enables MLLMs to (a) answer the question directly when the visual information is adequate, (b) zoom in gradually for a closer examination, and (c) zoom out to the previous view and explore other regions if the desired information is not initially found.

📜 Updates

  • 2025.01.01 🌟 We released the Project Page of ZoomEye, welcom to visit~
  • 2025.01.01 🌟 We released the evaluation code for MME-RealWorld.
  • 2024.11.30 🌟 We released the evaluation code for V* Bench and HR-Bench.
  • 2024.11.25 🌟 We released the ArXiv paper.

🛠️ Installation

This project is built based on LLaVA-Next. If you encounter unknown errors during installation, you can refer to the issues and solutions in it.

1. Clone this repository

git clone https://github.com/om-ai-lab/ZoomEye.git
cd ZoomEye

2. Install dependencies

conda create -n zoom_eye python=3.10 -y
conda activate zoom_eye
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

📚 Preparation

1. MLLM checkpoints

In our work, we implement Zoom Eye with LLaVA-v1.5 and LLaVA-OneVision(ov) series, you could download these checkpoints before running or automatically download them when executing the from_pretrained method in transformers.

2. Evaluation data

The core evaluation data (including V* Bench and HR-Bench) will be used has been packaged together, and the link is provided here. After downloading, please unzip it and its path is referred as to anno path.

[Optional] If you want to evaluate ZoomEye on MME-RealWorld Benchmark, you could follow the instructions in this repository to download the images and extract them to the <anno path>/mme-realworld directory. Meanwhile, place the annotation_mme-realworld.json file from this link into <anno path>/mme-realworld.

The folder tree is that:

zoom_eye_data 
  ├── hr-bench_4k                                  
  │   └── annotation_hr-bench_4k.json
  │   └── images/
  │     └── some.jpg
  │    ...
  ├── hr-bench_8k
  │   └── annotation_hr-bench_8k.json
  │   └── images/
  │     └── some.jpg
  │    ...
  │── vstar
  │   └── annotation_vstar.json
  │   └── direct_attributes/
  │     └── some.jpg
  │    ...
  │   └── relative_positions/
  │     └── some.jpg
  │    ...
  ├── mme-realworld
  │   └── annotation_mme-realworld.json
  │   └── AutonomousDriving/
  │   └── MME-HD-CN/
  │   └── monitoring_images/
  │   └── ocr_cc/
  │   └── remote_sensing/
 ...

🚀 Evaluation

1. Run the demo

We provide a demo file of Zoom Eye accepting any input Image-Question pair.

python ZoomEye/demo.py \
    --model-path lmms-lab/llava-onevision-qwen2-7b-ov \
    --input_image demo/demo.jpg \
    --question "What is the color of the soda can?"

and the zoomed views of Zoom Eye will be saved into the demo folder.

2. Run the Gradio Demo

We also provide a Gradio Demo, run the script and open http://127.0.0.1:7860/ in your browser.

python gdemo_gradio.py 

3. Results of V* Bench

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
vstar

# Get the result
python ZoomEye/eval/eval_results_vstar.py --answers-file ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl

The <mllm model> could be referred as to the above MLLM checkpoints, and the <anno path> is the path of the evaluation data.

If you don't have multi-gpu environment, you can set CUDA_VISIBLE_DEVICES=0.

4. Results of HR-Bench 4k

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_4k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_4k

# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_4k/merge.jsonl

5. Results of HR-Bench 8k

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_8k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_8k

# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl

6. Results for MLLMs with direct answering

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl
python ZoomEye/eval/perform_zoom_eye.py \
    --model-path <mllm model> \
    --annotation_path <anno path> \
    --benchmark <bench name> \
    --direct-answer

# Get the result
python ZoomEye/eval/eval_results_{vstar/hr-bench}.py --answers-file ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl

7. Results of MME-RealWorld

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/mme-realworld/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
mme-realworld

# Get the result
python ZoomEye/eval/eval_results_mme-realworld.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl

🔗 Related works

If you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer (EMNLP24)
🏠 GitHub Repository

🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
🏠 GitHub Repository

🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
🏠 Github Repository

⭐️ Citation

If you find this repository helpful to your research, welcome to cite our paper:

@article{shen2024zoomeye,
  title={ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration},
  author={Shen, Haozhan and Zhao, Kangjia and Zhao, Tiancheng and Xu, Ruochen and Zhang, Zilun and Zhu, Mingwei and Yin, Jianwei},
  journal={arXiv preprint arXiv:2411.16044},
  year={2024}
}

About

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published