🔍 ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Zoom Eye enables MLLMs to (a) answer the question directly when the visual information is adequate, (b) zoom in gradually for a closer examination, and (c) zoom out to the previous view and explore other regions if the desired information is not initially found.

📜 Updates

2025.01.01 🌟 We released the Project Page of ZoomEye, welcom to visit~
2025.01.01 🌟 We released the evaluation code for MME-RealWorld.
2024.11.30 🌟 We released the evaluation code for V^* Bench and HR-Bench.
2024.11.25 🌟 We released the ArXiv paper.

🛠️ Installation

This project is built based on LLaVA-Next. If you encounter unknown errors during installation, you can refer to the issues and solutions in it.

1. Clone this repository

git clone https://github.com/om-ai-lab/ZoomEye.git
cd ZoomEye

2. Install dependencies

conda create -n zoom_eye python=3.10 -y
conda activate zoom_eye
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

📚 Preparation

1. MLLM checkpoints

In our work, we implement Zoom Eye with LLaVA-v1.5 and LLaVA-OneVision(ov) series, you could download these checkpoints before running or automatically download them when executing the from_pretrained method in transformers.

2. Evaluation data

The core evaluation data (including V^* Bench and HR-Bench) will be used has been packaged together, and the link is provided here. After downloading, please unzip it and its path is referred as to anno path.

[Optional] If you want to evaluate ZoomEye on MME-RealWorld Benchmark, you could follow the instructions in this repository to download the images and extract them to the <anno path>/mme-realworld directory. Meanwhile, place the annotation_mme-realworld.json file from this link into <anno path>/mme-realworld.

The folder tree is that:

zoom_eye_data 
  ├── hr-bench_4k                                  
  │   └── annotation_hr-bench_4k.json
  │   └── images/
  │     └── some.jpg
  │    ...
  ├── hr-bench_8k
  │   └── annotation_hr-bench_8k.json
  │   └── images/
  │     └── some.jpg
  │    ...
  │── vstar
  │   └── annotation_vstar.json
  │   └── direct_attributes/
  │     └── some.jpg
  │    ...
  │   └── relative_positions/
  │     └── some.jpg
  │    ...
  ├── mme-realworld
  │   └── annotation_mme-realworld.json
  │   └── AutonomousDriving/
  │   └── MME-HD-CN/
  │   └── monitoring_images/
  │   └── ocr_cc/
  │   └── remote_sensing/
 ...

🚀 Evaluation

1. Run the demo

We provide a demo file of Zoom Eye accepting any input Image-Question pair.

python ZoomEye/demo.py \
    --model-path lmms-lab/llava-onevision-qwen2-7b-ov \
    --input_image demo/demo.jpg \
    --question "What is the color of the soda can?"

and the zoomed views of Zoom Eye will be saved into the demo folder.

2. Run the Gradio Demo

We also provide a Gradio Demo, run the script and open http://127.0.0.1:7860/ in your browser.

python gdemo_gradio.py

3. Results of V^* Bench

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
vstar

# Get the result
python ZoomEye/eval/eval_results_vstar.py --answers-file ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl

The <mllm model> could be referred as to the above MLLM checkpoints, and the <anno path> is the path of the evaluation data.

If you don't have multi-gpu environment, you can set CUDA_VISIBLE_DEVICES=0.

4. Results of HR-Bench 4k

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_4k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_4k

# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_4k/merge.jsonl

5. Results of HR-Bench 8k

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_8k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_8k

# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl

6. Results for MLLMs with direct answering

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl
python ZoomEye/eval/perform_zoom_eye.py \
    --model-path <mllm model> \
    --annotation_path <anno path> \
    --benchmark <bench name> \
    --direct-answer

# Get the result
python ZoomEye/eval/eval_results_{vstar/hr-bench}.py --answers-file ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl

7. Results of MME-RealWorld

# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/mme-realworld/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
mme-realworld

# Get the result
python ZoomEye/eval/eval_results_mme-realworld.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl

🔗 Related works

If you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer (EMNLP24)
🏠 GitHub Repository

🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
🏠 GitHub Repository

🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
🏠 Github Repository

⭐️ Citation

If you find this repository helpful to your research, welcome to cite our paper:

@article{shen2024zoomeye,
  title={ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration},
  author={Shen, Haozhan and Zhao, Kangjia and Zhao, Tiancheng and Xu, Ruochen and Zhang, Zilun and Zhu, Mingwei and Yin, Jianwei},
  journal={arXiv preprint arXiv:2411.16044},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ZoomEye		ZoomEye
demo		demo
docs		docs
llava		llava
trl		trl
.gitignore		.gitignore
README.md		README.md
demo_gradio.py		demo_gradio.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
temp_image.jpg		temp_image.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

📜 Updates

🛠️ Installation

1. Clone this repository

2. Install dependencies

📚 Preparation

1. MLLM checkpoints

2. Evaluation data

🚀 Evaluation

1. Run the demo

2. Run the Gradio Demo

3. Results of V^* Bench

4. Results of HR-Bench 4k

5. Results of HR-Bench 8k

6. Results for MLLMs with direct answering

7. Results of MME-RealWorld

🔗 Related works

⭐️ Citation

About

Releases

Packages

Contributors 2

Languages

om-ai-lab/ZoomEye

Folders and files

Latest commit

History

Repository files navigation

🔍 ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

📜 Updates

🛠️ Installation

1. Clone this repository

2. Install dependencies

📚 Preparation

1. MLLM checkpoints

2. Evaluation data

🚀 Evaluation

1. Run the demo

2. Run the Gradio Demo

3. Results of V* Bench

4. Results of HR-Bench 4k

5. Results of HR-Bench 8k

6. Results for MLLMs with direct answering

7. Results of MME-RealWorld

🔗 Related works

⭐️ Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

3. Results of V^* Bench

Packages