Ziyu Liu* · Zeyi Sun* · Yuhang Zang · Wei Li · Pan Zhang · Xiaoyi Dong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang
📖Paper |🏠Homepage
In this paper, we highlight the potential of combining retrieving and ranking with multi-modal large language models to revolutionize perception tasks such as fine-grained recognition, zero-shot image recognition, and few-shot object recognition. Motivated by the limited zero-shot/few-shot of CLIP and MLLMs on fine-grained datasets, our RAR designs the pipeline that uses MLLM to rank the retrieved results. Our proposed approach can be seamlessly integrated into various MLLMs for real-world applications where the variety and volume of categories continuously expand. Our method opens up new avenues for research in augmenting the MLLM’s abilities with the retrieving-augmented solution and could be beneficial for other tasks such as reasoning and generation in future works.
- 🚀 [03/25/2024] We are excited to announce the publication of our fine-tuning data, along with the code used to generate this data. Our sample JSON data is based on the FGVC-Aircraft dataset. You are encouraged to expand your research and experiments with additional datasets to uncover even more possibilities!
- 🚀 [03/20/2024] We upload part of our code in github, including Fine-Grained Visual Recognition and Few-Shot Image Recognition. More updata is coming soon !!!
- 🚀 [03/20/2024] Our work is submitted to arXiv.
- 🔥 We conduct an in-depth analysis of the strengths and weaknesses of VLMs and MLLMs in processing fine-grained datasets.
- 🔥 Our RAR can be seamlessly integrated into various MLLMs in a plug-and-play manner.
- 🔥 Through rigorous testing across 11 classification datasets and 2 object detection datasets, we demonstrate that our method outperforms baselines on a variety of visual recognition tasks.
- Install
- Prepare Data
- Generate finetune data
- Few-Shot Image Classification
- Fine-Grained Visual Recognition
If you are not using Linux, do NOT proceed, see instructions for macOS and Windows.
- Clone this repository and navigate to RAR folder
git clone https://github.com/Liuziyu77/RAR.git
cd RAR
- Prepare the environment step-by-step:
conda create -n rar python=3.10.13 -y # create RAR conda environment
conda activate rar # activate the environment and install dependencies
Navigate to the CLIP-Cls folder, and prepare the data following the instructions.
In our experiments, we have finetuned several MLLMs (Multimodal Large Language Models). The purpose of finetuning these models is to tap into their classification potential, enabling the MLLMs to provide answers in a standardized format. This facilitates the processing of our final results.
Within the finetune
folder, we have included an .ipynb
file for generating finetune data. The JSON file in this folder, based on the FGVA-Aircraft dataset, contains pre-generated finetune data. With minor format adjustments, this JSON file can be used for finetuning models such as LLaVa, Intern-Xcomposer, Qwen, and others.
An finetune data example is shown below:
{
"id": 0,
"image": [
"your picture path"
],
"conversations": [
{
"from": "user",
"value": "Here is a image:<Img index=1><image></Img>. Please play the role of a aircraft classification expert,
and sort the provided categories from high to low according to the top 5 similarity with the input image.
Here are the optional categories:['707-320', 'DC-8', 'DC-6', 'L-1011', '707-320']."
},
{
"from": "assistant",
"value": "['707-320', '707-320', 'DC-8', 'DC-6', 'L-1011']"
}
]
}
Navigate to the Few_shot folder, and run build_memory.ipynb
step by step to construct the the external memory.
When you finish the step above, three files will be generated:
{dataset_name}_{shot_number}_shot_database.txt
{dataset_name}_{shot_number}_shot_img_index.index
predictions_{shot_number}_shot_knn.pth
# For different datasets, we have different files.
# eg. caltech101_4_shot_database.txt
# eg. eurosat_8_shot_img_index.index
# eg. predictions_16_shot_knn.pth
The index file stores the indices of image embeddings that make up the memory. The txt file includes filenames and labels in corresponding order. The pth file contains test results obtained using the CLIP+KNN method, and you can use the code in CLIP_Cls to test its accuracy.
After that, you can test the retrieve and rank by run retrieve_and_rerank.py
. And a new pth file will be saved, it records the answers of VLM after ranking the retrieved results.
Before you run retrieve_and_rerank.py
, three parameters are needed to change:
shot_number = 4
top_k = 5
dataset_name = 'caltech101'
shot_number = 4
corresponds to 4-shot
setting, top_k
controls the number of retrievd items, and dataset_name
decides which dataset to be tested.
In this experiment, our testing is based on the FineR. Therefore, the first step is to clone the project using the git clone
command and install the required environment.
After that, navigate to the Fine-Grained Visual Recognition folder, run build_memory.ipynb
step by step to build the memory for five datasets(Pets37, Dogs120, Cars196, Flowers102 and Bird200). Here, we have prepared the built memory index and category names in Fine-Grained Visual Recognition/database
folder, the fold is organized as shown below:
├── database/
│ └── Pets37/
| ├── classnames.txt
| ├── paired_data_pets.txt
| ├── pets37_database.index
| ├── pets37_database.txt
│ └── Dog120/
│ └── Flowers102/
│ └── Cars196/
│ └── Bird200/
Next, you can run our provided Fine-Grained Visual Recognition/retrieve_test.ipynb
code to use our retrieval method for reselecting names. When U get the names, replace these names in FineR/experiments/pet37/guess/pet_llm_gussed_names_3.json
with these new names, and run sh FineR/scripts_eval/p_pipe.sh
to eval the sACC and cACC.
@misc{liu2024rar,
title={RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition},
author={Ziyu Liu and Zeyi Sun and Yuhang Zang and Wei Li and Pan Zhang and Xiaoyi Dong and Yuanjun Xiong and Dahua Lin and Jiaqi Wang},
year={2024},
eprint={2403.13805},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Usage and License Notices: The data and code are intended and licensed for research use only.