DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

The World's Top-Performing Vision Model for Open-World Object Detection

The project provides examples for using DINO-X, which is hosted on DeepDataSpace.

dino_x_intro.mp4

Highlights

Beyond Grounding DINO 1.5, DINO-X has several improvements, taking a step forward towards becoming a more general object-centric vision model. The highlights of the DINO-X are as follows:

✨ The Strongest Open-Set Detection Performance: DINO-X Pro set new SOTA results on zero-shot transfer detection benchmarks: 56.0 AP on COCO, 59.8 AP on LVIS-minival and 52.4 AP on LVIS-val. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 box AP and 5.0 box AP. Such a result underscores its significantly enhanced capacity for recognizing long-tailed objects.

🔥 Diverse Input Prompt and Multi-level Output Semantic Representations: DINO-X can accept text prompts, visual prompts, and customized prompts as input, and it outputs representations at various semantic levels, including bounding boxes, segmentation masks, pose keypoints, and object captions, with multiple perception heads.

🍉 Rich and Practical Capabilities: DINO-X can simultaneously support lots of highly practical tasks, including Open-Set Object Detection and Segmentation, Phrase Grounding, Visual-Prompt Counting, Pose Estimation, and Region Captioning. We further develop a universal object prompt to achieve Prompt-Free Anything Detection and Recognition.

🔌 Seamless AI Tool Integration: With DINO-X MCP Server, developers can integrate DINO-X's capabilities directly into Cursor, Claude, and other MCP-compatible AI tools, enabling object detection in conversational AI workflows.

Latest News

2025.07.23: We've updated dds-cloudapi-sdk to version 0.5.3, which significantly improves mask encoding by removing the previous non-standard method and adopting the pycocotools-aligned rle mask format. This change makes it much easier to decode masks directly with pycocotools, and we've added a new mask_format = coco_rle parameter to the API; you can find detailed usage examples here: dds visualization utils
2025.06.18: 🚀 DINO-X MCP Server is now available! Integrate DINO-X into Cursor and other MCP-compatible tools. Check dinox-mcp for details.
2025.05.21: For more demo usages, including DINO-X, T-Rex, DINO-X-SeeK, please check dds-cloud-api examples for more details.
2025.04.21: Update to dds-cloudapi-sdk API V2 version. The V1 version in the original API for DINO-X has been deprecated, please update to the latest dds-cloudapi-sdk by pip install dds-cloudapi-sdk -U to use DINO-X model. Please refer to dds-cloudapi-sdk and our API docs to view more details about the update.
2025.03.11: We have released DINO-XSeeK model towards detecting objects based on more complex user descriptions. Please refer to RexSeeK for more details and the demo has already been available at here.
2025.01.18: DINO-X achieves SOTA performance of 51.7 average mask AP score on Segmentation in the Wild zero-shot track.
2024.12.05: Released the Prompt-Free Anything Detection and Segmentation feature. For API usage and demo visualization, please refer to here. To use the latest features, please install dds-cloudapi-sdk==0.3.3.
2024.12.04: Launched the Open-World Detection and Segmentation feature. For API usage and demo visualization, visit here.
2024.12.03: Support DINO-X with SAM 2 for Open-World Anything Segmentation and Tracking. For more details, check out the Grounded SAM 2 project.

Model Framework

DINO-X can accept text prompts, visual prompts, and customized prompts as input, and it can generate representations at various semantic levels, including bounding boxes, segmentation masks, pose keypoints, and object captions.

Performance

Side-by-Side Performance Comparison with Previous Best Methods

Zero-Shot Performance on Object Detection Benchmarks

Model	COCO ^{^{(AP box)}}	LVIS-minival ^{^{(AP all)}}	LVIS-minival ^{^{(AP rare)}}	LVIS-val ^{^{(AP all)}}	LVIS-val ^{^{(AP rare)}}
Other Best Open-Set Model	53.4 ^{^{(OmDet-Turbo)}}	47.6 ^{^{(T-Rex2 visual)}}	45.4 ^{^{(T-Rex2 visual)}}	45.3 ^{^{(T-Rex2 visual)}}	43.8 ^{^{(T-Rex2 visual)}}
DetCLIPv3	-	48.8	49.9	41.4	41.4
Grounding DINO	52.5	27.4	18.1	-	-
T-Rex2 (text)	52.2	54.9	49.2	45.8	42.7
Grounding DINO 1.5 Pro	54.3	55.7	56.1	47.6	44.6
Grounding DINO 1.6 Pro	55.4	57.7	57.5	51.1	51.5
DINO-X Pro	56.0	59.7	63.3	52.4	56.5

Performance: DINO-X Pro achieves SOTA performance on COCO, LVIS-minival, LVIS-val, zero-shot object detection benchmarks.
Effective Long-tail Object Detection: DINO-X Pro has significantly improved the model's performance on LVIS-rare classes, significantly surpassing the previous SOTA Grounding DINO 1.6 Pro model by 5.8 AP and 5.0 AP, respectively, demonstrating the exceptional capability of DINO-X in long-tailed object detection scenarios.

Zero-Shot Performance on Generic Segmentation Benchmarks

Model	COCO ^{^{(AP mask)}}	LVIS-minival ^{^{(AP mask)}}	LVIS-minival ^{^{(AP mask rare)}}	LVIS-val ^{^{(AP mask)}}	LVIS-val ^{^{(AP mask rare)}}	SGinW ^{^{(AP mask avg)}}
Assembled General Perception Model
Grounded HQ-SAM (Base + Huge)	-	-	-	-	-	49.6
Grounded SAM (1.5 Pro + Huge)	44.3	47.7	50.2	41.8	46.0	-
Grounded SAM 2 (1.5 Pro + Large)	44.7	46.2	50.1	40.5	44.6	-
DINO-X Pro + SAM-Huge	44.2	51.2	52.2	-	-	-
Unified Vision Model
DINO-X Pro (Mask Head)	37.9	43.8	46.7	38.5	44.4	51.7

Performance: DINO-X achieves SOTA performance of 51.7 average mask AP on SGinW zero-shot benchmarks. And DINO-X also achieves mask AP scores of 37.9, 43.8, and 38.5 on the COCO, LVIS-minival, and LVIS-val zero-shot instance segmentation benchmarks, respectively.Compared to Grounded SAM and Grounded SAM 2, there is still a notable performance gap for DINO-X to catch up. We will further optimize the segmentation performance in the future release.
Efficiency: Unlike Grounded SAM series, DINO-X significantly improves the segmentation efficiency by generating corresponding masks for each region without requiring multiple complex inference steps.
Practical Usage: Users can use the mask function of DINO-X based on their actual needs. If the users require simultaneously object segmentation and tracking, we recommend using the latest Grounded SAM 2 (DINO-X + SAM 2), which we have already implemented in here.

API Usage

Installation

Install the required packages

pip install -r requirements.txt

Note: If you encounter some errors with API, please install the latest version of dds-cloudapi-sdk:

pip install dds-cloudapi-sdk --upgrade

Register on Offical Website to Get API Token

First-Time Application: If you are interested in our project and wish to try our algorithm, you will need to apply for the corresponding API Token through our request API token website for your first attempt.

Request Additional Token Quotas: At this stage, we now support WeChat Pay as a payment channel. Users can purchase additional API calls through our official platform. If you encounter any issues during the purchase process or have other collaboration needs, feel free to contact us via this email address: [email protected].

Run local API demos

Open-World Object Detection and Segmentation

Open-world detection means users can detect anything with text prompts, try this feature by setting your API token in demo.py and run local demo:

python demo.py

After running the local demo, the annotated image will be saved at: ./outputs/open_world_detection

Demo Image Visualization

With the text prompt "wheel . eye . helmet . mouse . mouth . vehicle . steering wheel . ear . nose", we will get the predicton results as follows:

Demo Image	Box Prediction	Mask Prediction

Prompt-Free Anything Detection and Segmentation

We've implemented a novel Prompt Free object detection feature, which means users do not need to provide any prompt and DINO-X will automatically recognize, detect and segment the objects in the provided images. You can try this feature with the following script after setting your API token:

python prompt_free_demo.py

After running the local demo, the annotated image will be saved at: ./outputs/prompt_free_detection_segmentation

Demo Image Visualization

With the specific text prompt "<prompt_free>", we will get the predicton results as follows:

Demo Image	Box Prediction	Mask Prediction

Related Work

Grounding DINO: Strong open-set object detection model.
Grounding DINO 1.5: Previous SOTA open-set detection model.
Grounded-Segment-Anything: Open-set detection and segmentation model by combining Grounding DINO with SAM.
T-Rex/T-Rex2: Generic open-set detection model supporting both text and visual prompts.

LICENSE

DINO-X API License

DINO-X is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

BibTeX

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{ren2024dinoxunifiedvisionmodel,
      title={DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding}, 
      author={Tianhe Ren and Yihao Chen and Qing Jiang and Zhaoyang Zeng and Yuda Xiong and Wenlong Liu and Zhengyu Ma and Junyi Shen and Yuan Gao and Xiaoke Jiang and Xingyu Chen and Zhuheng Song and Yuhong Zhang and Hongjie Huang and Han Gao and Shilong Liu and Hao Zhang and Feng Li and Kent Yu and Lei Zhang},
      year={2024},
      eprint={2411.14347},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.14347}, 
}

@misc{ren2024grounding,
      title={Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection}, 
      author={Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang and Feng Li and Peijun Tang and Kent Yu and Lei Zhang},
      year={2024},
      eprint={2405.10300},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{jiang2024trex2genericobjectdetection,
      title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy}, 
      author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang},
      year={2024},
      eprint={2403.14610},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2403.14610}, 
}

@misc{liu2024groundingdinomarryingdino,
      title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection}, 
      author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Qing Jiang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
      year={2024},
      eprint={2303.05499},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2303.05499}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
prompt_free_demo.py		prompt_free_demo.py
requirements.txt		requirements.txt
video-demo.py		video-demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Highlights

Latest News

Contents

Model Framework

Performance

Side-by-Side Performance Comparison with Previous Best Methods

Zero-Shot Performance on Object Detection Benchmarks

Zero-Shot Performance on Generic Segmentation Benchmarks

API Usage

Installation

Register on Offical Website to Get API Token

Run local API demos

Open-World Object Detection and Segmentation

Prompt-Free Anything Detection and Segmentation

Related Work

LICENSE

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Languages

License

IDEA-Research/DINO-X-API

Folders and files

Latest commit

History

Repository files navigation

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Highlights

Latest News

Contents

Model Framework

Performance

Side-by-Side Performance Comparison with Previous Best Methods

Zero-Shot Performance on Object Detection Benchmarks

Zero-Shot Performance on Generic Segmentation Benchmarks

API Usage

Installation

Register on Offical Website to Get API Token

Run local API demos

Open-World Object Detection and Segmentation

Prompt-Free Anything Detection and Segmentation

Related Work

LICENSE

BibTeX

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Languages

Packages