TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
RGB-D Salient Object Detection	NJUD	VST	S-Measure	0.922	# 1
RGB-D Salient Object Detection	NLPR	VST	S-Measure	0.932	# 13
Thermal Image Segmentation	RGB-T-Glass-Segmentation	VST	MAE	0.044	# 7
RGB-D Salient Object Detection	SIP	VST	S-Measure	90.4	# 2
RGB-D Salient Object Detection	SIP	VST	max E-Measure	94.4	# 3
RGB-D Salient Object Detection	SIP	VST	max F-Measure	91.5	# 2
RGB-D Salient Object Detection	SIP	VST	Average MAE	0.040	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-saliency-transformer/rgb-d-salient-object-detection-on-njud)](https://paperswithcode.com/sota/rgb-d-salient-object-detection-on-njud?p=visual-saliency-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-saliency-transformer/rgb-d-salient-object-detection-on-sip)](https://paperswithcode.com/sota/rgb-d-salient-object-detection-on-sip?p=visual-saliency-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-saliency-transformer/thermal-image-segmentation-on-rgb-t-glass)](https://paperswithcode.com/sota/thermal-image-segmentation-on-rgb-t-glass?p=visual-saliency-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-saliency-transformer/rgb-d-salient-object-detection-on-nlpr)](https://paperswithcode.com/sota/rgb-d-salient-object-detection-on-nlpr?p=visual-saliency-transformer)`

Visual Saliency Transformer

ICCV 2021 · Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, Junwei Han ·

Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

nnizhang/VST

117

Tasks

Add Remove

Boundary Detection

Decoder

object-detection

Object Detection

RGB-D Salient Object Detection

Saliency Detection

Salient Object Detection

Thermal Image Segmentation

Datasets

PASCAL-S

DUTS

HKU-IS

NLPR

LFSD

SIP

ReDWeb-S

Results from the Paper

Edit

Ranked #1 on RGB-D Salient Object Detection on NJUD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
RGB-D Salient Object Detection	NJUD	VST	S-Measure	0.922	# 1	Compare
RGB-D Salient Object Detection	NLPR	VST	S-Measure	0.932	# 13	Compare
Thermal Image Segmentation	RGB-T-Glass-Segmentation	VST	MAE	0.044	# 7	Compare
RGB-D Salient Object Detection	SIP	VST	S-Measure	90.4	# 2	Compare
			max E-Measure	94.4	# 3	Compare
			max F-Measure	91.5	# 2	Compare
			Average MAE	0.040	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Visual Saliency Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove