TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
GZSL Video Classification	ActivityNet-GZSL(main)	CJME	HM	5.12	# 7
GZSL Video Classification	ActivityNet-GZSL(main)	CJME	ZSL	5.84	# 6
GZSL Video Classification	UCF-GZSL(main)	CJME	HM	12.48	# 7
GZSL Video Classification	UCF-GZSL(main)	CJME	ZSL	8.29	# 7
GZSL Video Classification	VGGSound-GZSL(main)	CJME	HM	6.17	# 5
GZSL Video Classification	VGGSound-GZSL(main)	CJME	ZSL	5.16	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coordinated-joint-multimodal-embeddings-for/gzsl-video-classification-on-vggsound-gzsl-1)](https://paperswithcode.com/sota/gzsl-video-classification-on-vggsound-gzsl-1?p=coordinated-joint-multimodal-embeddings-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coordinated-joint-multimodal-embeddings-for/gzsl-video-classification-on-activitynet-gzsl-1)](https://paperswithcode.com/sota/gzsl-video-classification-on-activitynet-gzsl-1?p=coordinated-joint-multimodal-embeddings-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coordinated-joint-multimodal-embeddings-for/gzsl-video-classification-on-ucf-gzsl-main)](https://paperswithcode.com/sota/gzsl-video-classification-on-ucf-gzsl-main?p=coordinated-joint-multimodal-embeddings-for)`

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

19 Oct 2019 · Kranti Kumar Parida, Neeraj Matiyali, Tanaya Guha, Gaurav Sharma ·

We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we also construct an appropriate multimodal dataset with 33 classes containing 156,416 videos, from an existing large scale audio event dataset. We empirically show that the performance improves by adding audio modality for both tasks of zeroshot classification and retrieval, when using multimodal extensions of embedding learning methods. We also propose a novel method to predict the `dominant' modality using a jointly learned modality attention network. We learn the attention in a semi-supervised setting and thus do not require any additional explicit labelling for the modalities. We provide qualitative validation of the modality specific attention, which also successfully generalizes to unseen test classes.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

General Classification

GZSL Video Classification

Retrieval

Datasets

ActivityNet

Results from the Paper

Edit

Ranked #5 on GZSL Video Classification on VGGSound-GZSL(main)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
GZSL Video Classification	ActivityNet-GZSL(main)	CJME	HM	5.12	# 7	Compare
GZSL Video Classification	ActivityNet-GZSL(main)	CJME	ZSL	5.84	# 6	Compare
GZSL Video Classification	UCF-GZSL(main)	CJME	HM	12.48	# 7	Compare
GZSL Video Classification	UCF-GZSL(main)	CJME	ZSL	8.29	# 7	Compare
GZSL Video Classification	VGGSound-GZSL(main)	CJME	HM	6.17	# 5	Compare
GZSL Video Classification	VGGSound-GZSL(main)	CJME	ZSL	5.16	# 6	Compare

Methods

Add Remove

Test

Edit Social Preview

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove