TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Action Recognition	ActivityNet	ResT	Top-1 Accuracy	32.5	# 3
Zero-Shot Action Recognition	HMDB51	ResT	Top-1 Accuracy	41.1	# 13
Zero-Shot Action Recognition	UCF101	ResT	Top-1 Accuracy	58.7	# 13

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-modal-representation-learning-for-zero/zero-shot-action-recognition-on-activitynet)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-activitynet?p=cross-modal-representation-learning-for-zero)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-modal-representation-learning-for-zero/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=cross-modal-representation-learning-for-zero)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-modal-representation-learning-for-zero/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=cross-modal-representation-learning-for-zero)`

Cross-modal Representation Learning for Zero-shot Action Recognition

CVPR 2022 · Chung-Ching Lin, Kevin Lin, Linjie Li, Lijuan Wang, Zicheng Liu ·

We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic gap, and the hubness problem. Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will be made available.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Representation Learning

Zero-Shot Action Recognition

Datasets

ImageNet

UCF101

Kinetics

HMDB51

ActivityNet

Sports-1M

Results from the Paper

Add Remove

Ranked #3 on Zero-Shot Action Recognition on ActivityNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Action Recognition	ActivityNet	ResT	Top-1 Accuracy	32.5	# 3	Compare
Zero-Shot Action Recognition	HMDB51	ResT	Top-1 Accuracy	41.1	# 13	Compare
Zero-Shot Action Recognition	UCF101	ResT	Top-1 Accuracy	58.7	# 13	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Cross-modal Representation Learning for Zero-shot Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove