Grounded SAM 2

Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Grounding DINO 1.5, Florence-2 and SAM 2.

🔥 Project Highlight

In this repo, we've supported the following demo with simple implementations:

  • Ground and Segment Anything with Grounding DINO, Grounding DINO 1.5 & 1.6 and SAM 2
  • Ground and Track Anything with Grounding DINO, Grounding DINO 1.5 & 1.6 and SAM 2
  • Detect, Segment and Track Visualization based on the powerful supervision library.

Grounded SAM 2 does not introduce significant methodological changes compared to Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. Both approaches leverage the capabilities of open-world models to address complex visual tasks. Consequently, we try to simplify the code implementation in this repository, aiming to enhance user convenience.



  • 2024/08/31: Support dump json results in Grounded SAM 2 Image Demos (with Grounding DINO).
  • 2024/08/20: Support Florence-2 SAM 2 Image Demo which includes dense region caption, object detection, phrase grounding, and cascaded auto-label pipeline caption + phrase grounding.
  • 2024/08/09: Support Ground and Track New Object throughout the whole videos. This feature is still under development now. Credits to Shuo Shen.
  • 2024/08/07: Support Custom Video Inputs, users need only submit their video file (e.g. .mp4 file) with specific text prompts to get an impressive demo videos.



Download the pretrained SAM 2 checkpoints:

cd checkpoints

Download the pretrained Grounding DINO checkpoints:

cd gdino_checkpoints

Installation without docker

Install PyTorch environment first. We use python=3.10, as well as torch >= 2.3.1, torchvision>=0.18.1 and cuda-12.1 in our environment to run this demo. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended. You can easily install the latest version of PyTorch as follows:

pip3 install torch torchvision torchaudio

Since we need the CUDA compilation environment to compile the Deformable Attention operator used in Grounding DINO, we need to check whether the CUDA environment variables have been set correctly (which you can refer to Grounding DINO Installation for more details). You can set the environment variable manually as follows if you want to build a local GPU environment for Grounding DINO to run Grounded SAM 2:

export CUDA_HOME=/path/to/cuda-12.1/

Install Segment Anything 2:

pip install -e .

Install Grounding DINO:

pip install --no-build-isolation -e grounding_dino

Installation with docker

Build the Docker image and Run the Docker container:

cd Grounded-SAM-2
make build-image
make run

After executing these commands, you will be inside the Docker environment. The working directory within the container is set to: /home/appuser/Grounded-SAM-2

Once inside the Docker environment, you can start the demo by running:


Grounded SAM 2 Demos

Grounded SAM 2 Image Demo (with Grounding DINO)

Note that Grounding DINO has already been supported in Huggingface, so we provide two choices for running Grounded SAM 2 model:

  • Use huggingface API to inference Grounding DINO (which is simple and clear)


🚨 If you encounter network issues while using the HuggingFace model, you can resolve them by setting the appropriate mirror source as export HF_ENDPOINT=

  • Load local pretrained Grounding DINO checkpoint and inference with Grounding DINO original API (make sure you've already downloaded the pretrained checkpoint)

Grounded SAM 2 Image Demo (with Grounding DINO 1.5 & 1.6)

We've already released our most capable open-set detection model Grounding DINO 1.5 & 1.6, which can be combined with SAM 2 for stronger open-set detection and segmentation capability. You can apply the API token first and run Grounded SAM 2 with Grounding DINO 1.5 as follows:

Install the latest DDS cloudapi:

pip install dds-cloudapi-sdk

Apply your API token from our official website here: request API token.


Automatically Saving Grounding Results (Image Demo)

After setting DUMP_JSON_RESULTS=True in the following Grounded SAM 2 Image Demos:

The grounding and segmentation results will be automatically saved in the outputs dir with the following format:

    "image_path": "path/to/image.jpg",
    "annotations": [
            "class_name": "class_name",
            "bbox": [x1, y1, x2, y2],
            "segmentation": {
                "size": [h, w],
                "counts": "rle_encoded_mask"
            "score": confidence score
    "box_format": "xyxy",
    "img_width": w,
    "img_height": h

Grounded SAM 2 Video Object Tracking Demo

Based on the strong tracking capability of SAM 2, we can combined it with Grounding DINO for open-set object segmentation and tracking. You can run the following scripts to get the tracking results with Grounded SAM 2:

  • The tracking results of each frame will be saved in ./tracking_results
  • The video will be save as children_tracking_demo_video.mp4
  • You can refine this file with different text prompt and video clips yourself to get more tracking results.
  • We only prompt the first video frame with Grounding DINO here for simple usage.

Support Various Prompt Type for Tracking

We've supported different types of prompt for Grounded SAM 2 tracking demo:

  • Point Prompt: In order to get a stable segmentation results, we re-use the SAM 2 image predictor to get the prediction mask from each object based on Grounding DINO box outputs, then we uniformly sample points from the prediction mask as point prompts for SAM 2 video predictor
  • Box Prompt: We directly use the box outputs from Grounding DINO as box prompts for SAM 2 video predictor
  • Mask Prompt: We use the SAM 2 mask prediction results based on Grounding DINO box outputs as mask prompt for SAM 2 video predictor.

Grounded SAM 2 Tracking Pipeline

Grounded SAM 2 Video Object Tracking Demo (with Grounding DINO 1.5 & 1.6)

We've also support video object tracking demo based on our stronger Grounding DINO 1.5 model and SAM 2, you can try the following demo after applying the API keys for running Grounding DINO 1.5:


Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with Grounding DINO)

Users can upload their own video file (e.g. assets/hippopotamus.mp4) and specify their custom text prompts for grounding and tracking with Grounding DINO and SAM 2 by using the following scripts:


If you are not convenient to use huggingface demo, you can also run tracking demo with local grounding dino model with the following scripts:


Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with Grounding DINO 1.5 & 1.6)

Users can upload their own video file (e.g. assets/hippopotamus.mp4) and specify their custom text prompts for grounding and tracking with Grounding DINO 1.5 and SAM 2 by using the following scripts:


You can specify the params in this file:

VIDEO_PATH = "./assets/hippopotamus.mp4"
TEXT_PROMPT = "hippopotamus."
OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
API_TOKEN_FOR_GD1_5 = "Your API token" # api token for G-DINO 1.5
PROMPT_TYPE_FOR_VIDEO = "mask" # using SAM 2 mask prediction as prompt for video predictor

After running our demo code, you can get the tracking results as follows:


And we will automatically save the tracking visualization results in OUTPUT_VIDEO_PATH.


We initialize the box prompts on the first frame of the input video. If you want to start from different frame, you can refine ann_frame_idx by yourself in our code.

Grounded-SAM-2 Video Object Tracking with Continuous ID (with Grounding DINO)

In above demos, we only prompt Grounded SAM 2 in specific frame, which may not be friendly to find new object during the whole video. In this demo, we try to find new objects and assign them with new ID across the whole video, this function is still under develop. it's not that stable now.

Users can upload their own video files and specify custom text prompts for grounding and tracking using the Grounding DINO and SAM 2 frameworks. To do this, execute the script:


You can customize various parameters including:

  • text: The grounding text prompt.
  • video_dir: Directory containing the video files.
  • output_dir: Directory to save the processed output.
  • output_video_path: Path for the output video.
  • step: Frame stepping for processing.
  • box_threshold: box threshold for groundingdino model
  • text_threshold: text threshold for groundingdino model Note: This method supports only the mask type of text prompt.

After running our demo code, you can get the tracking results as follows:


If you want to try Grounding DINO 1.5 model, you can run the following scripts after setting your API token:


Grounded-SAM-2 Video Object Tracking with Continuous ID plus Reverse Tracking(with Grounding DINO)

This method could simply cover the whole lifetime of the object


Grounded SAM 2 Florence-2 Demos

Grounded SAM 2 Florence-2 Image Demo

In this section, we will explore how to integrate the feature-rich and robust open-source models Florence-2 and SAM 2 to develop practical applications.

Florence-2 is a powerful vision foundation model by Microsoft which supports a series of vision tasks by prompting with special task_prompt includes but not limited to:

Task Task Prompt Text Input Task Introduction
Object Detection <OD> Detect main objects with single category name
Dense Region Caption <DENSE_REGION_CAPTION> Detect main objects with short description
Region Proposal <REGION_PROPOSAL> Generate proposals without category name
Phrase Grounding <CAPTION_TO_PHRASE_GROUNDING> Ground main objects in image mentioned in caption
Referring Expression Segmentation <REFERRING_EXPRESSION_SEGMENTATION> Ground the object which is most related to the text input
Open Vocabulary Detection and Segmentation <OPEN_VOCABULARY_DETECTION> Ground any object with text input

Integrate Florence-2 with SAM-2, we can build a strong vision pipeline to solve complex vision tasks, you can try the following scripts to run the demo:


🚨 If you encounter network issues while using the HuggingFace model, you can resolve them by setting the appropriate mirror source as export HF_ENDPOINT=

Object Detection and Segmentation

python \
    --pipeline object_detection_segmentation \
    --image_path ./notebooks/images/cars.jpg

Dense Region Caption and Segmentation

python \
    --pipeline dense_region_caption_segmentation \
    --image_path ./notebooks/images/cars.jpg

Region Proposal and Segmentation

python \
    --pipeline region_proposal_segmentation \
    --image_path ./notebooks/images/cars.jpg

Phrase Grounding and Segmentation

python \
    --pipeline phrase_grounding_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "The image shows two vintage Chevrolet cars parked side by side, with one being a red convertible and the other a pink sedan, \
            set against the backdrop of an urban area with a multi-story building and trees. \
            The cars have Cuban license plates, indicating a location likely in Cuba."

Referring Expression Segmentation

python \
    --pipeline referring_expression_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "The left red car."

Open-Vocabulary Detection and Segmentation

python \
    --pipeline open_vocabulary_detection_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "car <and> building"
  • Note that if you want to detect multi-objects you should split them with <and> in your input text.

Grounded SAM 2 Florence-2 Image Auto-Labeling Demo

Florence-2 can be used as a auto image annotator by cascading its caption capability with its grounding capability.

Task Task Prompt Text Input
Caption + Phrase Grounding <CAPTION> + <CAPTION_TO_PHRASE_GROUNDING>
Detailed Caption + Phrase Grounding <DETAILED_CAPTION> + <CAPTION_TO_PHRASE_GROUNDING>
More Detailed Caption + Phrase Grounding <MORE_DETAILED_CAPTION> + <CAPTION_TO_PHRASE_GROUNDING>

You can try the following scripts to run these demo:

Caption to Phrase Grounding

python \
    --image_path ./notebooks/images/groceries.jpg \
    --pipeline caption_to_phrase_grounding \
    --caption_type caption
  • You can specify caption_type to control the granularity of the caption, if you want a more detailed caption, you can try --caption_type detailed_caption or --caption_type more_detailed_caption.


