We will release our benchmark code soon.
- [18/10/2024] We release the benchmark code.
- [16/06/2024] 📄 Paper on arxiv has released!
- Updates & News
- Contents
- Dataset: GUI-World
- GUI-Vid: A GUI-Oriented VideoLLM
- Contribution
- Acknowledgments
- Citation
GUI-World introduces a comprehensive benchmark for evaluating MLLMs in dynamic and complex GUI environments. It features extensive annotations covering six GUI scenarios and eight types of GUI-oriented questions. The dataset assesses state-of-the-art ImageLLMs and VideoLLMs, highlighting their limitations in handling dynamic and multi-step tasks. It provides valuable insights and a foundation for future research in enhancing the understanding and interaction capabilities of MLLMs with dynamic GUI content. This dataset aims to advance the development of robust GUI agents capable of perceiving and interacting with both static and dynamic GUI elements.
GUI-World is splited to train and test set, which can be accessed from huggingface.
GUI-Vid is a VideoLLM finetuned from Videochat2. You can reproduce our experiment results following these instructions: Prepare the Environment
git clone https://github.com/Dongping-Chen/GUI-World.git
cd GUI-World/GUI_Vid
conda create -n gui python=3.9
conda activate gui
pip install -r requirements.txt
GUI-Oriented Finetuning
- Download [GUI-World] and modify the root path in
GUI_Vid/configs/instruction_data.py
, which is the root dir for your download GUI-World. - Set
vit_blip_model_path
,llama_model_path
andvideochat2_model_path
inGUI_Vid/scripts/config_7b_stage3.py
, these checkpoints can be download from GUI-Vid.
# Vicuna
bash GUI_Vid/scripts/run_7b_stage3.sh
Inference with GUI-Vid
You can first download checkpoint from Huggingface. You also need to set the config according to the guidance in Videochat2.
Then, set the model_path
in scripts/demo_local.py
. Use the following script to inference a GUI video:
python demo_local.py \
--ckpt_path <path to GUI-Vid> \
--keyframe 8 \
--video_path <path to your video> \
--qs <your query>
In our paper, we use five settings to extract keyframes in video. For Human
and Linspace
(we employed uniform sampling to select 10 frames from each video, maintaining equal intervals between frames. This is the previous Random
setting and we now use Linspace
replacing it to avoid confusion), you can refer to the original file of our annotation and perform it by np.linspace
. For Program
, we use Katna to extract keyframes and our code is in GUI_Vid/scripts/katna.py
. For VIP and R3M based on UVD, which are additional experiments in NeurIPS Rebuttal, we extract keyframes locally and you can download them from this link.
Contributions to this project are welcome. Please consider the following ways to contribute:
- Proposing new features or improvements
- Benchmark other mainstream MLLMs
Many thanks to Yinuo Liu, Zhengyan Fu, Shilin Zhang, Yu, Tianhe Gu, Haokuan Yuan, and Junqi Wang for their invalueble effort in this project. This project is based on methodologies and code presented in Videochat2.
@article{chen2024gui,
title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents},
author={Chen, Dongping and Huang, Yue and Wu, Siyuan and Tang, Jingyu and Chen, Liuyi and Bai, Yilin and He, Zhigang and Wang, Chenlong and Zhou, Huichi and Li, Yiqiang and others},
journal={arXiv preprint arXiv:2406.10819},
year={2024}
}