🔍 ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Zoom Eye enables MLLMs to (a) answer the question directly when the visual information is adequate, (b) zoom in gradually for a closer examination, and (c) zoom out to the previous view and explore other regions if the desired information is not initially found.
2025.01.01
🌟 We released the Project Page of ZoomEye, welcom to visit~2025.01.01
🌟 We released the evaluation code for MME-RealWorld.2024.11.30
🌟 We released the evaluation code for V* Bench and HR-Bench.2024.11.25
🌟 We released the ArXiv paper.
This project is built based on LLaVA-Next. If you encounter unknown errors during installation, you can refer to the issues and solutions in it.
git clone https://github.com/om-ai-lab/ZoomEye.git
cd ZoomEye
conda create -n zoom_eye python=3.10 -y
conda activate zoom_eye
pip install --upgrade pip # Enable PEP 660 support.
pip install -e ".[train]"
In our work, we implement Zoom Eye with LLaVA-v1.5 and LLaVA-OneVision(ov) series, you could download these checkpoints before running or automatically download them when executing the from_pretrained method in transformers.
The core evaluation data (including V* Bench and HR-Bench) will be used has been packaged together, and the link is provided here. After downloading, please unzip it and its path is referred as to anno path.
[Optional] If you want to evaluate ZoomEye on MME-RealWorld Benchmark, you could follow the instructions in this repository to download the images and extract them to the <anno path>/mme-realworld directory. Meanwhile, place the annotation_mme-realworld.json file from this link into <anno path>/mme-realworld.
The folder tree is that:
zoom_eye_data
├── hr-bench_4k
│ └── annotation_hr-bench_4k.json
│ └── images/
│ └── some.jpg
│ ...
├── hr-bench_8k
│ └── annotation_hr-bench_8k.json
│ └── images/
│ └── some.jpg
│ ...
│── vstar
│ └── annotation_vstar.json
│ └── direct_attributes/
│ └── some.jpg
│ ...
│ └── relative_positions/
│ └── some.jpg
│ ...
├── mme-realworld
│ └── annotation_mme-realworld.json
│ └── AutonomousDriving/
│ └── MME-HD-CN/
│ └── monitoring_images/
│ └── ocr_cc/
│ └── remote_sensing/
...
We provide a demo file of Zoom Eye accepting any input Image-Question pair.
python ZoomEye/demo.py \
--model-path lmms-lab/llava-onevision-qwen2-7b-ov \
--input_image demo/demo.jpg \
--question "What is the color of the soda can?"
and the zoomed views of Zoom Eye will be saved into the demo folder.
We also provide a Gradio Demo, run the script and open http://127.0.0.1:7860/ in your browser.
python gdemo_gradio.py
# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
vstar
# Get the result
python ZoomEye/eval/eval_results_vstar.py --answers-file ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl
The <mllm model> could be referred as to the above MLLM checkpoints, and the <anno path> is the path of the evaluation data.
If you don't have multi-gpu environment, you can set CUDA_VISIBLE_DEVICES=0.
# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_4k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_4k
# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_4k/merge.jsonl
# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_8k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_8k
# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl
# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl
python ZoomEye/eval/perform_zoom_eye.py \
--model-path <mllm model> \
--annotation_path <anno path> \
--benchmark <bench name> \
--direct-answer
# Get the result
python ZoomEye/eval/eval_results_{vstar/hr-bench}.py --answers-file ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl
# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/mme-realworld/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
mme-realworld
# Get the result
python ZoomEye/eval/eval_results_mme-realworld.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl
If you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer (EMNLP24)
🏠 GitHub Repository
🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
🏠 GitHub Repository
🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
🏠 Github Repository
If you find this repository helpful to your research, welcome to cite our paper:
@article{shen2024zoomeye,
title={ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration},
author={Shen, Haozhan and Zhao, Kangjia and Zhao, Tiancheng and Xu, Ruochen and Zhang, Zilun and Zhu, Mingwei and Yin, Jianwei},
journal={arXiv preprint arXiv:2411.16044},
year={2024}
}