-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support InstructBLIP #1685
base: dev
Are you sure you want to change the base?
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# MiniGPT4 | ||
|
||
> [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although | ||
vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced. | ||
|
||
<div align=center> | ||
<img src="https://github.com/open-mmlab/mmpretrain/assets/48375204/4211e0d8-951f-48d0-b81d-34be2e777390" width="80%"/> | ||
</div> | ||
|
||
## How to use it? | ||
|
||
<!-- [TABS-BEGIN] --> | ||
|
||
**Use the model** | ||
|
||
```python | ||
from mmpretrain import inference_model | ||
|
||
result = inference_model('instructblip-vicuna7b_3rdparty-zeroshot_caption', 'demo/cat-dog.png') | ||
print(result) | ||
# {'pred_caption': 'a blanket next to each other in the grass\na cute puppy and kitten wallpapers'} | ||
``` | ||
|
||
<!-- [TABS-END] --> | ||
|
||
## Models and results | ||
|
||
For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please change to instructblip page https://github.com/salesforce/LAVIS/tree/main/projects/instructblip |
||
|
||
### Pretrained models | ||
|
||
| Model | Params (M) | Flops (G) | Config | Download | | ||
| :-------------------------------------------------- | :--------: | :-------: | :----------------------------------------------: | :--------------------------------------------------------------------------------: | | ||
| `instructblip-vicuna7b_3rdparty-zeroshot_caption`\* | 8121.32 | N/A | [config](instructblip-vicuna7b_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/instructblip/instruct-blip_vicuna7b_trimmed.pth) | | ||
|
||
*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip). The config files of these models are only for inference. We haven't reproduce the training results.* | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{dai2023instructblip, | ||
title={InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning}, | ||
author={Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven}, | ||
journal={arXiv preprint arXiv:2305.06500}, | ||
year={2023} | ||
} | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
_base_ = [ | ||
'../_base_/datasets/coco_caption.py', | ||
'../_base_/default_runtime.py', | ||
] | ||
|
||
# model settings | ||
model = dict( | ||
type='InstructBlipCaption', | ||
llm_tokenizer=dict( | ||
type='LlamaTokenizer', | ||
name_or_path= | ||
'/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't use our path |
||
vision_encoder=dict( | ||
type='BEiTViT', | ||
# eva-g without the final layer | ||
arch=dict( | ||
embed_dims=1408, | ||
num_layers=39, | ||
num_heads=16, | ||
feedforward_channels=6144, | ||
), | ||
img_size=224, | ||
patch_size=14, | ||
out_indices=-2, | ||
layer_scale_init_value=0.0, | ||
use_abs_pos_emb=True, | ||
use_rel_pos_bias=False, | ||
frozen_stages=39, | ||
final_norm=False, | ||
use_shared_rel_pos_bias=False, | ||
out_type='raw', | ||
pretrained= # noqa | ||
'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth' # noqa | ||
), | ||
text_backbone=dict( | ||
type='AutoModelForCausalLM', | ||
name_or_path= | ||
'/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the same comment as above |
||
Qformer=dict( | ||
type='Qformer', | ||
model_style='bert-base-uncased', | ||
vision_model_width=1408, | ||
add_cross_attention=True, | ||
cross_attention_freq=2, | ||
num_query_token=32), | ||
prompt='Write a short description for the image.', | ||
max_txt_len=30) | ||
|
||
# schedule settings | ||
optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05)) | ||
|
||
param_scheduler = [ | ||
dict( | ||
type='CosineAnnealingLR', | ||
by_epoch=True, | ||
begin=0, | ||
end=10, | ||
) | ||
] | ||
|
||
train_cfg = dict(max_epochs=10) | ||
val_cfg = dict() | ||
test_cfg = dict() | ||
|
||
# dataset settings | ||
test_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict( | ||
type='Resize', | ||
scale=(224, 224), | ||
interpolation='bicubic', | ||
backend='pillow'), | ||
dict(type='PackInputs', meta_keys=['image_id']), | ||
] | ||
|
||
val_dataloader = dict(dataset=dict(pipeline=test_pipeline)) | ||
test_dataloader = val_dataloader |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
Collections: | ||
- Name: InstructBLIP | ||
Metadata: | ||
Training Data: | ||
- COCO | ||
- VG | ||
- CC3M | ||
- CC12M | ||
- SBU | ||
- LAION-400M | ||
Architecture: | ||
- Transformer | ||
- Q-Former | ||
Paper: | ||
Title: 'InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning' | ||
URL: https://arxiv.org/abs/2305.06500 | ||
README: configs/instructblip/README.md | ||
|
||
Models: | ||
- Name: instructblip-vicuna7b_3rdparty-zeroshot_caption | ||
Metadata: | ||
FLOPs: null | ||
Parameters: xxx | ||
In Collection: InstructBLIP | ||
Results: | ||
- Task: Image Caption | ||
Dataset: COCO | ||
Metrics: null | ||
Weights: https://download.openmmlab.com/mmclassification/v1/instructblip/instruct-blip_vicuna7b_trimmed.pth | ||
Config: configs/instructblip/instructblip-vicuna7b_8xb32_caption.py | ||
Converted From: | ||
Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth | ||
Code: https://github.com/salesforce/LAVIS |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,8 +52,13 @@ | |
'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth' # noqa | ||
), | ||
lang_encoder=dict( | ||
type='AutoModelForCausalLM', name_or_path='YOUR_PATH_TO_VICUNA'), | ||
tokenizer=dict(type='LlamaTokenizer', name_or_path='YOUR_PATH_TO_VICUNA'), | ||
type='AutoModelForCausalLM', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. plz restore the modification |
||
name_or_path= | ||
'/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'), | ||
tokenizer=dict( | ||
type='LlamaTokenizer', | ||
name_or_path= | ||
'/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'), | ||
task='caption', | ||
prompt_template='###Human: {} ###Assistant: ', | ||
raw_prompts=[ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do you upload this image |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from .instructblip_caption import InstructBlipCaption | ||
|
||
__all__ = ['InstructBlipCaption'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.