(NeurIPS 2024) Visual Sketchpad

This repo contains codes for the paper "Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models"

🌐 Homepage | 📖 arXiv | 📑 Paper

🔔News

🔥[2024-10-28]: Thanks @velocityCavalry for reporting a potential bug! Updated codebase to be more robust

🔥[2024-09-26]: Accepted to NeurIPS 2024!

🔥[2024-08-03]: Releasing the codes for Visual Sketchpad

Introduction

Installation

Install the agent environment as follows:

conda create -n sketchpad python=3.9

pip install pyautogen==0.2.26
pip install 'pyautogen[jupyter-executor]'
pip install Pillow joblib matplotlib opencv-python numpy gradio gradio_client networkx scipy datasets

Set up your OpenAI API key in agent/config.py. Edit the following:

# set up the LLM for the agent
os.environ['OPENAI_API_KEY'] = '[YOUR OPENAI API KEY]'
os.environ["AUTOGEN_USE_DOCKER"] = "False"
llm_config={"cache_seed": None, "config_list": [{"model": "gpt-4o", "temperature": 0.0, "api_key": os.environ.get("OPENAI_API_KEY")}]}

Above is all it needs for math and geometry tasks.

Installing vision experts for computer vision tasks

For computer vision tasks, you also need to install the vision experts. In this code base, each vision expert is a gradio server. You can set them up in other servers, and access them through web link. This allows you to run sketchpad agents on your computer, while all vision models running on another GPU server. Follow vision_experts/installation.md to install and launch all the vision experts.

After the server is launched, please edit the gradio servers link in agent/config.py. Change the server addresses to yours.

SOM_ADDRESS = "[YOUR SOM SERVER ADDRESS]"
GROUNDING_DINO_ADDRESS = "[YOUR GroundingDINO SERVER ADDRESS]"
DEPTH_ANYTHING_ADDRESS = "[YOUR Depth-Anything SERVER ADDRESS]"

Quick Start

Data

We preprocessed each task and put them into tasks. Each instance in each task has a separate folder. Some tasks are too big, so we put it in this Google Drive Link. Please download, unzip, and put the content in the tasks folder.

Run the agent

See agent/quick_start_math.py for a simple example of running the math tasks. As seen, the code is modularized. The key function is run_agent in agent/main.py, which use the agent to finish a task.

from main import run_agent

# run a example for graph max flow. save the execution trace, answer, and usage summary under outputs/graph_maxflow
run_agent("../tasks/graph_maxflow/5", "../outputs/graph_max_flow", task_type="math", task_name="graph_maxflow")

# run a example for geometry. save the execution trace, answer, and usage summary under outputs/geometry
run_agent("../tasks/geometry/2079", "../outputs/geometry", task_type="geo")

After installing and setting up all the gradio servers, you can also try run the vision task agent in agent/quick_start_vision.py. The structure is similar:

from main import run_agent

# run a example for vision tasks. save the execution trace to outputs/blink_spatial
run_agent("../tasks/blink_spatial/processed/val_Spatial_Relation_1", "../outputs/blink_spatial", task_type="vision")

We put the expected running outputs in outputs as reference.

View agent running traces.

See record_viewer.ipynb. It is a good example of how Visual Sketchpad works. Also, it shows how to visualize an agent running trace saved in output.json.

Run a task

If you want to run all the examples in a task. First run the following:

cd agents

# for example, run blink spatial relation task
python run_task.py --task blink_spatial

This will run the whole task and save all execution traces to outputs. Notice that the task should be one of "vstar", "blink_viscorr", "blink_semcorr", "blink_depth","blink_jigsaw", "blink_spatial", "mmvp", "geometry", "graph_connectivity", "graph_isomorphism", "graph_maxflow", "math_convexity", "math_parity", "winner_id"

Agent Trajectories

To facilitate future research, we also share the agent trajectories we get on all tasks in the paper in this Google Drive Link。 They have the same format as the examples in outputs in this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
agent		agent
assets		assets
outputs		outputs
tasks		tasks
vision_experts		vision_experts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
record_viewer.ipynb		record_viewer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

(NeurIPS 2024) Visual Sketchpad

🔔News

Introduction

Installation

Installing vision experts for computer vision tasks

Quick Start

Data

Run the agent

View agent running traces.

Run a task

Agent Trajectories

About

Releases

Packages

Contributors 2

Languages

License

Yushi-Hu/VisualSketchpad

Folders and files

Latest commit

History

Repository files navigation

(NeurIPS 2024) Visual Sketchpad

🔔News

Introduction

Installation

Installing vision experts for computer vision tasks

Quick Start

Data

Run the agent

View agent running traces.

Run a task

Agent Trajectories

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages