Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prompt Depth Anything Model #35401

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,8 @@
title: NAT
- local: model_doc/poolformer
title: PoolFormer
- local: model_doc/prompt_depth_anything
title: Prompt Depth Anything
- local: model_doc/pvt
title: Pyramid Vision Transformer (PVT)
- local: model_doc/pvt_v2
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,7 @@ Flax), PyTorch, and/or TensorFlow.
| [PLBart](model_doc/plbart) | ✅ | ❌ | ❌ |
| [PoolFormer](model_doc/poolformer) | ✅ | ❌ | ❌ |
| [Pop2Piano](model_doc/pop2piano) | ✅ | ❌ | ❌ |
| [PromptDepthAnything](model_doc/prompt_depth_anything) | ✅ | ❌ | ❌ |
| [ProphetNet](model_doc/prophetnet) | ✅ | ❌ | ❌ |
| [PVT](model_doc/pvt) | ✅ | ❌ | ❌ |
| [PVTv2](model_doc/pvt_v2) | ✅ | ❌ | ❌ |
Expand Down
93 changes: 93 additions & 0 deletions docs/source/en/model_doc/prompt_depth_anything.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Prompt Depth Anything

## Overview

The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://promptda.github.io/) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang.


The abstract from the paper is as follows:

*Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.*

<img src="https://promptda.github.io/assets/teaser.jpg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to open a PR on this repo, specifically this folder: https://huggingface.co/datasets/huggingface/documentation-images/tree/main/transformers/model_doc to add a prompt_depth_anything_architecture.jpg picture

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your kind help! I have uploaded the image and opened a PR.
https://huggingface.co/datasets/huggingface/documentation-images/discussions/408

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NielsRogge Could you please help merge this PR into documentation-images repo? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, merged

alt="drawing" width="600"/>

<small> Prompt Depth Anything overview. Taken from the <a href="https://promptda.github.io/">original paper</a>.</small>

## Usage example

The transformers library allows you to use the model with just a few lines of code:

```python
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/promptda_vits_hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/promptda_vits_hf")

>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
... outputs,
... target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 1000
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint16")) # mm
```

## Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Prompt Depth Anything.

- [Prompt Depth Anything Demo](https://huggingface.co/spaces/depth-anything/PromptDA)
- [Prompt Depth Anything Interactive Results](https://promptda.github.io/interactive.html)

If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

## PromptDepthAnythingConfig

[[autodoc]] PromptDepthAnythingConfig

## PromptDepthAnythingForDepthEstimation

[[autodoc]] PromptDepthAnythingForDepthEstimation
- forward

## PromptDepthAnythingImageProcessor

[[autodoc]] PromptDepthAnythingImageProcessor
- preprocess
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,7 @@
"models.plbart": ["PLBartConfig"],
"models.poolformer": ["PoolFormerConfig"],
"models.pop2piano": ["Pop2PianoConfig"],
"models.prompt_depth_anything": ["PromptDepthAnythingConfig"],
"models.prophetnet": [
"ProphetNetConfig",
"ProphetNetTokenizer",
Expand Down Expand Up @@ -1246,6 +1247,7 @@
_import_structure["models.pix2struct"].extend(["Pix2StructImageProcessor"])
_import_structure["models.pixtral"].append("PixtralImageProcessor")
_import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
_import_structure["models.prompt_depth_anything"].extend(["PromptDepthAnythingImageProcessor"])
_import_structure["models.pvt"].extend(["PvtImageProcessor"])
_import_structure["models.qwen2_vl"].extend(["Qwen2VLImageProcessor"])
_import_structure["models.rt_detr"].extend(["RTDetrImageProcessor"])
Expand Down Expand Up @@ -3181,6 +3183,12 @@
"Pop2PianoPreTrainedModel",
]
)
_import_structure["models.prompt_depth_anything"].extend(
[
"PromptDepthAnythingForDepthEstimation",
"PromptDepthAnythingPreTrainedModel",
]
)
_import_structure["models.prophetnet"].extend(
[
"ProphetNetDecoder",
Expand Down Expand Up @@ -5682,6 +5690,7 @@
from .models.pop2piano import (
Pop2PianoConfig,
)
from .models.prompt_depth_anything import PromptDepthAnythingConfig
from .models.prophetnet import (
ProphetNetConfig,
ProphetNetTokenizer,
Expand Down Expand Up @@ -6260,6 +6269,7 @@
PoolFormerFeatureExtractor,
PoolFormerImageProcessor,
)
from .models.prompt_depth_anything import PromptDepthAnythingImageProcessor
from .models.pvt import PvtImageProcessor
from .models.qwen2_vl import Qwen2VLImageProcessor
from .models.rt_detr import RTDetrImageProcessor
Expand Down Expand Up @@ -7819,6 +7829,10 @@
Pop2PianoForConditionalGeneration,
Pop2PianoPreTrainedModel,
)
from .models.prompt_depth_anything import (
PromptDepthAnythingForDepthEstimation,
PromptDepthAnythingPreTrainedModel,
)
from .models.prophetnet import (
ProphetNetDecoder,
ProphetNetEncoder,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@
plbart,
poolformer,
pop2piano,
prompt_depth_anything,
prophetnet,
pvt,
pvt_v2,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,7 @@
("plbart", "PLBartConfig"),
("poolformer", "PoolFormerConfig"),
("pop2piano", "Pop2PianoConfig"),
("prompt_depth_anything", "PromptDepthAnythingConfig"),
("prophetnet", "ProphetNetConfig"),
("pvt", "PvtConfig"),
("pvt_v2", "PvtV2Config"),
Expand Down Expand Up @@ -554,6 +555,7 @@
("plbart", "PLBart"),
("poolformer", "PoolFormer"),
("pop2piano", "Pop2Piano"),
("prompt_depth_anything", "PromptDepthAnything"),
("prophetnet", "ProphetNet"),
("pvt", "PVT"),
("pvt_v2", "PVTv2"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@
("pix2struct", ("Pix2StructImageProcessor",)),
("pixtral", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
("poolformer", ("PoolFormerImageProcessor",)),
("prompt_depth_anything", ("PromptDepthAnythingImageProcessor",)),
("pvt", ("PvtImageProcessor",)),
("pvt_v2", ("PvtImageProcessor",)),
("qwen2_vl", ("Qwen2VLImageProcessor",)),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -893,6 +893,7 @@
("depth_anything", "DepthAnythingForDepthEstimation"),
("dpt", "DPTForDepthEstimation"),
("glpn", "GLPNForDepthEstimation"),
("prompt_depth_anything", "PromptDepthAnythingForDepthEstimation"),
("zoedepth", "ZoeDepthForDepthEstimation"),
]
)
Expand Down
31 changes: 31 additions & 0 deletions src/transformers/models/prompt_depth_anything/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_prompt_depth_anything import PromptDepthAnythingConfig
from .image_processing_prompt_depth_anything import PromptDepthAnythingImageProcessor
from .modeling_prompt_depth_anything import (
PromptDepthAnythingForDepthEstimation,
PromptDepthAnythingPreTrainedModel,
)
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/prompt_depth_anything/modular_prompt_depth_anything.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_prompt_depth_anything.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

import copy

from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ...utils.backbone_utils import verify_backbone_config_arguments
from ..auto.configuration_auto import CONFIG_MAPPING


logger = logging.get_logger(__name__)


class PromptDepthAnythingConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`PromptDepthAnythingModel`]. It is used to instantiate a PromptDepthAnything
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the PromptDepthAnything
[LiheYoung/depth-anything-small-hf](https://huggingface.co/LiheYoung/depth-anything-small-hf) architecture.

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.

Args:
backbone_config (`Union[Dict[str, Any], PretrainedConfig]`, *optional*):
The configuration of the backbone model. Only used in case `is_hybrid` is `True` or in case you want to
leverage the [`AutoBackbone`] API.
backbone (`str`, *optional*):
Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
Whether to use pretrained weights for the backbone.
use_timm_backbone (`bool`, *optional*, defaults to `False`):
Whether or not to use the `timm` library for the backbone. If set to `False`, will use the [`AutoBackbone`]
API.
backbone_kwargs (`dict`, *optional*):
Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
patch_size (`int`, *optional*, defaults to 14):
The size of the patches to extract from the backbone features.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
reassemble_hidden_size (`int`, *optional*, defaults to 384):
The number of input channels of the reassemble layers.
reassemble_factors (`List[int]`, *optional*, defaults to `[4, 2, 1, 0.5]`):
The up/downsampling factors of the reassemble layers.
neck_hidden_sizes (`List[str]`, *optional*, defaults to `[48, 96, 192, 384]`):
The hidden sizes to project to for the feature maps of the backbone.
fusion_hidden_size (`int`, *optional*, defaults to 64):
The number of channels before fusion.
head_in_index (`int`, *optional*, defaults to -1):
The index of the features to use in the depth estimation head.
head_hidden_size (`int`, *optional*, defaults to 32):
The number of output channels in the second convolution of the depth estimation head.
depth_estimation_type (`str`, *optional*, defaults to `"relative"`):
The type of depth estimation to use. Can be one of `["relative", "metric"]`.
max_depth (`float`, *optional*):
The maximum depth to use for the "metric" depth estimation head. 20 should be used for indoor models
and 80 for outdoor models. For "relative" depth estimation, this value is ignored.

Example:

```python
>>> from transformers import PromptDepthAnythingConfig, PromptDepthAnythingForDepthEstimation

>>> # Initializing a PromptDepthAnything small style configuration
>>> configuration = PromptDepthAnythingConfig()

>>> # Initializing a model from the PromptDepthAnything small style configuration
>>> model = PromptDepthAnythingForDepthEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```"""

model_type = "prompt_depth_anything"

def __init__(
self,
backbone_config=None,
backbone=None,
use_pretrained_backbone=False,
use_timm_backbone=False,
backbone_kwargs=None,
patch_size=14,
initializer_range=0.02,
reassemble_hidden_size=384,
reassemble_factors=[4, 2, 1, 0.5],
neck_hidden_sizes=[48, 96, 192, 384],
fusion_hidden_size=64,
head_in_index=-1,
head_hidden_size=32,
depth_estimation_type="relative",
max_depth=None,
**kwargs,
):
super().__init__(**kwargs)
if backbone_config is None and backbone is None:
logger.info("`backbone_config` is `None`. Initializing the config with the default `Dinov2` backbone.")
backbone_config = CONFIG_MAPPING["dinov2"](
image_size=518,
hidden_size=384,
num_attention_heads=6,
out_indices=[9, 10, 11, 12],
apply_layernorm=True,
reshape_hidden_states=False,
)
elif isinstance(backbone_config, dict):
backbone_model_type = backbone_config.get("model_type")
config_class = CONFIG_MAPPING[backbone_model_type]
backbone_config = config_class.from_dict(backbone_config)

verify_backbone_config_arguments(
use_timm_backbone=use_timm_backbone,
use_pretrained_backbone=use_pretrained_backbone,
backbone=backbone,
backbone_config=backbone_config,
backbone_kwargs=backbone_kwargs,
)

self.backbone_config = backbone_config
self.backbone = backbone
self.use_pretrained_backbone = use_pretrained_backbone
self.use_timm_backbone = use_timm_backbone
self.backbone_kwargs = backbone_kwargs
self.reassemble_hidden_size = reassemble_hidden_size
self.patch_size = patch_size
self.initializer_range = initializer_range
self.reassemble_factors = reassemble_factors
self.neck_hidden_sizes = neck_hidden_sizes
self.fusion_hidden_size = fusion_hidden_size
self.head_in_index = head_in_index
self.head_hidden_size = head_hidden_size
if depth_estimation_type not in ["relative", "metric"]:
raise ValueError("depth_estimation_type must be one of ['relative', 'metric']")
self.depth_estimation_type = depth_estimation_type
self.max_depth = max_depth if max_depth else 1

def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)

if output["backbone_config"] is not None:
output["backbone_config"] = self.backbone_config.to_dict()

output["model_type"] = self.__class__.model_type
return output


__all__ = ["PromptDepthAnythingConfig"]
Loading