Name	Name	Last commit message	Last commit date
Latest commit jwyang adapt to new interface Nov 9, 2023 71d14b8 · Nov 9, 2023 History 52 Commits
assets	assets	minor update	Nov 8, 2023
benchmark	benchmark	add benchmark readme	Nov 8, 2023
configs	configs	upload som demo code	Oct 23, 2023
examples	examples	add screenshot	Nov 7, 2023
ops	ops	add deformable	Oct 31, 2023
task_adapter	task_adapter	refine demo	Nov 8, 2023
.gitignore	.gitignore	Initial commit	Oct 16, 2023
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md committed	Oct 16, 2023
LICENSE	LICENSE	LICENSE committed	Oct 16, 2023
README.md	README.md	update	Nov 8, 2023
SECURITY.md	SECURITY.md	SECURITY.md committed	Oct 16, 2023
SUPPORT.md	SUPPORT.md	SUPPORT.md committed	Oct 16, 2023
demo_gpt4v_som.py	demo_gpt4v_som.py	refine demo	Nov 8, 2023
demo_som.py	demo_som.py	adapt to new interface	Nov 9, 2023
download_ckpt.sh	download_ckpt.sh	upload som demo code	Oct 23, 2023
gpt4v.py	gpt4v.py	refine demo	Nov 8, 2023

Repository files navigation

Set-of-Mark Prompting or GPT-4V

🍇 [Read our arXiv Paper] 🍎 [Project Page]

Jianwei Yang*⚑, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao

* Core Contributors ⚑ Project Lead

Introduction

We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM -- GPT-4V. Let's using visual prompting for vision!

🔥 News

[11/07] We released the vision benchmark we used to evaluate GPT-4V with SoM prompting! Check out the benchmark page!
[11/07] Now that GPT-4V API has been released, we are releasing a demo integrating SoM into GPT-4V!

export OPENAI_API_KEY=YOUR_API_KEY
python demo_gpt4v_som.py

[10/23] We released the SoM toolbox code for generating set-of-mark prompts for GPT-4V. Try it out!

🔗 Related links

Fascinating applications of SoM and LMMs:

Set-of-Mark Prompting by Roboflow: Reimplementation of SoM by @SkalskiP from Roboflow
Set-of-Mark Prompting for UI Navigation Agent: A really brilliant work using GPT-4V and SoM as a web copilot!

Our method compiles the following models to generate the set of marks:

Mask DINO: State-of-the-art closed-set image segmentation model
OpenSeeD: State-of-the-art open-vocabulary image segmentation model
GroundingDINO: State-of-the-art open-vocabulary object detection model
SEEM: Versatile, promptable, interactive and semantic-aware segmentation model
Semantic-SAM: Segment and recognize anything at any granularity
Segment Anything: Segment anything

We are standing on the shoulder of the giant GPT-4V (playground)!

🚀 Quick Start

Install segmentation packages

# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install mpi4py

Download the pretrained models

sh download_ckpt.sh

Run the demo

python demo_som.py

And you will see this interface:

Potential solutions for some common issues:

Errors when installing Semantic-SAM

👉 Comparing standard GPT-4V and its combination with SoM Prompting

📍 SoM Toolbox for image partition

Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.

🦄 Interleaved Prompt

SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices.

🎖️ Mark types used in SoM

🌋 Evaluation tasks examples

Use case

🌷 Grounded Reasoning and Cross-Image Reference

In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17

🏕️ Problem Solving

Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.

🏔️ Knowledge Sharing

Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.

🕌 Personalized Suggestion

SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks

🌼 Tool Usage Instruction

Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction , teaching users the function of each button on a controller. Note that this image is not fully labeled, while GPT-4V can also provide information about the non-labeled buttons.

🌻 2D Game Planning

GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.

🕌 Simulated Navigation

🌳 Results

We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{yang2023setofmark,
      title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V}, 
      author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
      journal={arXiv preprint arXiv:2310.11441},
      year={2023},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Set-of-Mark Prompting or GPT-4V

Introduction

🔥 News

🔗 Related links

🚀 Quick Start

👉 Comparing standard GPT-4V and its combination with SoM Prompting

📍 SoM Toolbox for image partition

🦄 Interleaved Prompt

🎖️ Mark types used in SoM

🌋 Evaluation tasks examples

Use case

🌷 Grounded Reasoning and Cross-Image Reference

🏕️ Problem Solving

🏔️ Knowledge Sharing

🕌 Personalized Suggestion

🌼 Tool Usage Instruction

🌻 2D Game Planning

🕌 Simulated Navigation

🌳 Results

✒️ Citation

About

Releases

Packages

Contributors 8

Languages

License

microsoft/SoM

Folders and files

Latest commit

History

Repository files navigation

Set-of-Mark Prompting or GPT-4V

Introduction

🔥 News

🔗 Related links

🚀 Quick Start

👉 Comparing standard GPT-4V and its combination with SoM Prompting

📍 SoM Toolbox for image partition

🦄 Interleaved Prompt

🎖️ Mark types used in SoM

🌋 Evaluation tasks examples

Use case

🌷 Grounded Reasoning and Cross-Image Reference

🏕️ Problem Solving

🏔️ Knowledge Sharing

🕌 Personalized Suggestion

🌼 Tool Usage Instruction

🌻 2D Game Planning

🕌 Simulated Navigation

🌳 Results

✒️ Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages