TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	RMSE	0.275	# 14
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	absolute relative error	0.076	# 16
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	Delta < 1.25	0.954	# 14
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	Delta < 1.25^2	0.994	# 14
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	Delta < 1.25^3	0.999	# 4
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	log 10	0.033	# 16

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-tokens-unifying-output-space-of-visual/monocular-depth-estimation-on-nyu-depth-v2)](https://paperswithcode.com/sota/monocular-depth-estimation-on-nyu-depth-v2?p=all-in-tokens-unifying-output-space-of-visual)`

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

ICCV 2023 · Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu ·

Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook embeddings. Soft token can improve the accuracy of both the next token inference and decoding of the task output; 2) Mask augmentation. Many visual tasks have corruption, undefined or invalid values in label annotations, i.e., occluded area of depth maps. We show that a mask augmentation technique can greatly benefit these tasks. With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark. The general-purpose task-solver, dubbed AiT, is available at \url{https://github.com/SwinTransformer/AiT}.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

swintransformer/ait official

Tasks

Add Remove

Depth Estimation

Instance Segmentation

Monocular Depth Estimation

Semantic Segmentation

Datasets

MS COCO

NYUv2

Results from the Paper

Edit

Ranked #14 on Monocular Depth Estimation on NYU-Depth V2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Monocular Depth Estimation	NYU-Depth V2	AiT-P(SwinV2-L)	RMSE	0.275	# 14	Compare
			absolute relative error	0.076	# 16	Compare
			Delta < 1.25	0.954	# 14	Compare
			Delta < 1.25^2	0.994	# 14	Compare
			Delta < 1.25^3	0.999	# 4	Compare
			log 10	0.033	# 16	Compare

Methods

Add Remove

VQ-VAE

Edit Social Preview

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove