Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction
ACL (Findings) 2024
Meishan Zhang, Hao Fei*, Bin Wang , Shengqiong Wu, Yixin Cao, Fei Li, Min Zhang (*Correspondence )
This repository contains the code of Grounded MUIE, REAMO.
In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), REAMO, capable of extracting and grounding information from all modalities, i.e., 'recognizing everything from all modalities at once'. REAMO is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of REAMO with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research.
REAMO MLLM consists of three main parts: multimodal encoder, LLM, and decoder for UIE prediction & multimodal grounding:
-
Multimodal Encoding: leverage the ImageBind as a unified multimodal encoder. Then, via a projection layer, different input representations are aligned into language-like embeddings that are understandable to the LLM.
-
LLM Reasoner: use the Vicuna-v1.5 as the backbone LLM.
-
MUIE Decoding with Grounding: utilize SEEM for image segmentation and video tracking, and SHAS for audio segmentation.
We leverage three groups of data for training and evaluating:
- Datasets for Alignment Learning: CC3M, Webvid, and Audiocaps.
- Datasets for Cross-modal Grounding-aware Tuning: MSCOCO, Tao, and SpeechNER
- Datasets for Specific Tasks Learning: PASCAL-C, VRD, imSitu, ACE2005, ReTACRED, VidSitu, Twt17, MNRE, and M2E2.
Please first clone the repo and install the required environment, which can be done by running the following commands:
conda env create -n reamo python=3.8
conda activate reamo
# CUDA 12.1
conda install pytorch==2.1.2 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
git clone https://github.com/scofield7419/MUIE-REAMO.git
cd MUIE-REAMO
pip install -r requirements.txt
Firstly, you need to modify the parameter, DATASET_NAME_LIST
to determine the dataset used for training and fine-tuning.
Then, run the command for training and fine-tuning:
# for alignment learning
bash pretrain.sh
# for Fine-grained Cross-modal Grounding-aware Tuning and Invocation-based Meta-response Tuning.
bash fine-tune.sh
Firstly, you need to prepare the checkpoint, and adjust the following two parameters:
model_base
: the checkpoint path for base model.
model_path
: the checkpoint path for fine-tuned parameters.
Then, run the command to get the prediction:
python predict.py
@inproceedings{Zhang00W0LZ24,
author = {Meishan Zhang and
Hao Fei and
Bin Wang and
Shengqiong Wu and
Yixin Cao and
Fei Li and
Min Zhang},
title = {Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction},
booktitle = {Findings of the ACL},
pages = {14498--14511},
year = {2024}
}
Our code is based on the respective official repositories, NExT-GPT, SEEM, and SHAS. We fully thank the authors to release their code.