TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Saliency Detection	DHF1K	ViNet	NSS	2.87	# 1
Video Saliency Detection	DHF1K	ViNet	CC	0.51	# 1
Video Saliency Detection	DHF1K	ViNet	s-AUC	0.728	# 1
Video Saliency Detection	DHF1K	ViNet	AUC-J	0.908	# 1
Video Saliency Detection	DIEM	AViNet	CC	0.632	# 1
Video Saliency Detection	Hollywood2	ViNet	CC	0.693	# 1
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	SIM	0.627	# 1
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	CC	0.733	# 1
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	NSS	2.13	# 1
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	AUC-J	0.864	# 1
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	KLDiv	0.497	# 1
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	FPS	1.10	# 14
Video Saliency Detection	UCFSports	ViNet	CC	0.673	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avinet-diving-deep-into-audio-visual-saliency/video-saliency-detection-on-dhf1k)](https://paperswithcode.com/sota/video-saliency-detection-on-dhf1k?p=avinet-diving-deep-into-audio-visual-saliency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avinet-diving-deep-into-audio-visual-saliency/video-saliency-detection-on-diem)](https://paperswithcode.com/sota/video-saliency-detection-on-diem?p=avinet-diving-deep-into-audio-visual-saliency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avinet-diving-deep-into-audio-visual-saliency/video-saliency-detection-on-hollywood2)](https://paperswithcode.com/sota/video-saliency-detection-on-hollywood2?p=avinet-diving-deep-into-audio-visual-saliency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avinet-diving-deep-into-audio-visual-saliency/video-saliency-detection-on-msu-video)](https://paperswithcode.com/sota/video-saliency-detection-on-msu-video?p=avinet-diving-deep-into-audio-visual-saliency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avinet-diving-deep-into-audio-visual-saliency/video-saliency-detection-on-ucfsports)](https://paperswithcode.com/sota/video-saliency-detection-on-ucfsports?p=avinet-diving-deep-into-audio-visual-saliency)`

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

11 Dec 2020 · Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi ·

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

PDF Abstract

Code

Add Remove Mark official

samyak0210/ViNet official

Tasks

Add Remove

Action Recognition

Saliency Prediction

Video Saliency Detection

Video Saliency Prediction

Datasets

SumMe AVE

DHF1K

MSU Video Saliency Prediction

Results from the Paper

Edit

Ranked #1 on Video Saliency Detection on MSU Video Saliency Prediction

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Saliency Detection	DHF1K	ViNet	NSS	2.87	# 1	Compare
			CC	0.51	# 1	Compare
			s-AUC	0.728	# 1	Compare
			AUC-J	0.908	# 1	Compare
Video Saliency Detection	DIEM	AViNet	CC	0.632	# 1	Compare
Video Saliency Detection	Hollywood2	ViNet	CC	0.693	# 1	Compare
Video Saliency Detection	MSU Video Saliency Prediction	ViNet (dave)	SIM	0.627	# 1	Compare
			CC	0.733	# 1	Compare
			NSS	2.13	# 1	Compare
			AUC-J	0.864	# 1	Compare
			KLDiv	0.497	# 1	Compare
			FPS	1.10	# 14	Compare
Video Saliency Detection	UCFSports	ViNet	CC	0.673	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove