TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Monocular Depth Estimation	NYU-Depth V2	VPD	RMSE	0.254	# 12
Monocular Depth Estimation	NYU-Depth V2	VPD	absolute relative error	0.069	# 12
Monocular Depth Estimation	NYU-Depth V2	VPD	Delta < 1.25	0.964	# 10
Monocular Depth Estimation	NYU-Depth V2	VPD	Delta < 1.25^2	0.995	# 11
Monocular Depth Estimation	NYU-Depth V2	VPD	Delta < 1.25^3	0.999	# 4
Monocular Depth Estimation	NYU-Depth V2	VPD	log 10	0.030	# 12
Referring Expression Segmentation	RefCoCo val	VPD	Overall IoU	73.25	# 7
Referring Expression Segmentation	RefCoCo val	VPD	Overall IoU	73.25	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unleashing-text-to-image-diffusion-models-for-1/referring-expression-segmentation-on-refcoco-7)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-7?p=unleashing-text-to-image-diffusion-models-for-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unleashing-text-to-image-diffusion-models-for-1/referring-expression-segmentation-on-refcoco)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco?p=unleashing-text-to-image-diffusion-models-for-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unleashing-text-to-image-diffusion-models-for-1/monocular-depth-estimation-on-nyu-depth-v2)](https://paperswithcode.com/sota/monocular-depth-estimation-on-nyu-depth-v2?p=unleashing-text-to-image-diffusion-models-for-1)`

Unleashing Text-to-Image Diffusion Models for Visual Perception

ICCV 2023 · Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie zhou, Jiwen Lu ·

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

wl-zhao/VPD official

462

open-mmlab/mmsegmentation

7,406

Tasks

Add Remove

Denoising

Depth Estimation

Image Segmentation

Monocular Depth Estimation

Referring Expression Segmentation

Segmentation

Semantic Segmentation

Datasets

ADE20K

NYUv2

RefCOCO

LAION-5B

Results from the Paper

Add Remove

Ranked #7 on Referring Expression Segmentation on RefCoCo val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Monocular Depth Estimation	NYU-Depth V2	VPD	RMSE	0.254	# 12	Compare
			absolute relative error	0.069	# 12	Compare
			Delta < 1.25	0.964	# 10	Compare
			Delta < 1.25^2	0.995	# 11	Compare
			Delta < 1.25^3	0.999	# 4	Compare
			log 10	0.030	# 12	Compare
Referring Expression Segmentation	RefCoCo val	VPD	Overall IoU	73.25	# 7	Compare
Referring Expression Segmentation	RefCoCo val	VPD	Overall IoU	73.25	# 10	Compare

Methods

Add Remove

AutoEncoder • Denoising Autoencoder • Diffusion

Edit Social Preview

Unleashing Text-to-Image Diffusion Models for Visual Perception

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove