Intel/dpt-hybrid-midas | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

Intel/dpt-hybrid-midas

任务:

深度估计

类库:

PyTorch Transformers

其他:

dpt vision Eval Results

预印本库:

arxiv:2103.13413

许可:

apache-2.0

模型介绍文件清单

英文

模型详细信息：DPT-Hybrid

Dense Prediction Transformer（DPT）模型是在140万张图像上进行单眼深度估计训练的。它由Ranftl等人在2021年的论文 Vision Transformers for Dense Prediction 中提出，并于 this repository 首次发布。DPT使用视觉Transformer（ViT）作为主干网络，然后在其上添加了颈部和头部用于单眼深度估计。

本存储库托管了论文中所述的“混合”版本的模型。DPT-Hybrid与DPT有所不同，它使用 ViT-hybrid 作为主干网络，并从主干网络中提取了一些激活。

模型卡片由Hugging Face团队和Intel共同撰写。

Model Detail	Description
Model Authors - Company	Intel
Date	December 22, 2022
Version	1
Type	Computer Vision - Monocular Depth Estimation
Paper or Other Resources	1236321 and 1237321
License	Apache 2.0
Questions or Comments	1238321 and 1239321

Intended Use	Description
Primary intended uses	You can use the raw model for zero-shot monocular depth estimation. See the 12310321 to look for fine-tuned versions on a task that interests you.
Primary intended users	Anyone doing monocular depth estimation
Out-of-scope uses	This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

使用方法

这是如何在图像上进行零样本深度估计的模型使用方法：

from PIL import Image
import numpy as np
import requests
import torch

from transformers import DPTForDepthEstimation, DPTFeatureExtractor

model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# prepare image for the model
inputs = feature_extractor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth.show()

更多的代码示例，请参考 documentation 。

Factors	Description
Groups	Multiple datasets compiled together
Instrumentation	-
Environment	Inference completed on Intel Xeon Platinum 8280 CPU @ 2.70GHz with 8 physical cores and an NVIDIA RTX 2080 GPU.
Card Prompts	Model deployment on alternate hardware and software will change model performance

Metrics	Description
Model performance measures	Zero-shot Transfer
Decision thresholds	-
Approaches to uncertainty and variability	-

Training and Evaluation Data	Description
Datasets	The dataset is called MIX 6, and contains around 1.4M images. The model was initialized with ImageNet-pretrained weights.
Motivation	To build a robust monocular depth prediction network
Preprocessing	"We resize the image such that the longer side is 384 pixels and train on random square crops of size 384. ... We perform random horizontal flips for data augmentation." See 12312321 for more details.

定量分析

Model	Training set	DIW WHDR	ETH3D AbsRel	Sintel AbsRel	KITTI δ>1.25	NYU δ>1.25	TUM δ>1.25
DPT - Large	MIX 6	10.82 (-13.2%)	0.089 (-31.2%)	0.270 (-17.5%)	8.46 (-64.6%)	8.32 (-12.9%)	9.97 (-30.3%)
DPT - Hybrid	MIX 6	11.06 (-11.2%)	0.093 (-27.6%)	0.274 (-16.2%)	11.56 (-51.6%)	8.69 (-9.0%)	10.89 (-23.2%)
MiDaS	MIX 6	12.95 (+3.9%)	0.116 (-10.5%)	0.329 (+0.5%)	16.08 (-32.7%)	8.71 (-8.8%)	12.51 (-12.5%)
MiDaS [30]	MIX 5	12.46	0.129	0.327	23.90	9.55	14.29
Li [22]	MD [22]	23.15	0.181	0.385	36.29	27.52	29.54
Li [21]	MC [21]	26.52	0.183	0.405	47.94	18.57	17.71
Wang [40]	WS [40]	19.09	0.205	0.390	31.92	29.57	20.18
Xian [45]	RW [45]	14.59	0.186	0.422	34.08	27.00	25.02
Casser [5]	CS [8]	32.80	0.235	0.422	21.15	39.58	37.18

表1. 在单目深度估计上与现有技术的比较。我们根据[30]中定义的协议，评估了零样本跨数据集的迁移。相对性能是相对于原始MiDaS模型[30]计算的。所有指标越低越好。（ Ranftl et al., 2021 ）

Ethical Considerations	Description
Data	The training data come from multiple image datasets compiled together.
Human life	The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of monocular depth image datasets.
Mitigations	No additional risk mitigation strategies were considered during model development.
Risks and harms	The extent of the risks involved by using the model remain unknown.
Use cases	-

Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. There are no additional caveats or recommendations for this model.

BibTeX条目和引用信息

@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Ranftl and
               Alexey Bochkovskiy and
               Vladlen Koltun},
  title     = {Vision Transformers for Dense Prediction},
  journal   = {CoRR},
  volume    = {abs/2103.13413},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.13413},
  eprinttype = {arXiv},
  eprint    = {2103.13413},
  timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

作者:

Intel

数据集大小:

466.98 MB