TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Charades	ActionCLIP (ViT-B/16)	MAP	44.3	# 19
Action Recognition In Videos	Kinetics-400	ActionCLIP (ViT-B/16)	Top-1 Accuracy	83.8	# 2
Action Classification	Kinetics-400	ActionCLIP (CLIP-pretrained)	Acc@1	83.8	# 58
Action Classification	Kinetics-400	ActionCLIP (CLIP-pretrained)	Acc@5	97.1	# 31

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionclip-a-new-paradigm-for-video-action/action-recognition-in-videos-on-kinetics-400-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-kinetics-400-1?p=actionclip-a-new-paradigm-for-video-action)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionclip-a-new-paradigm-for-video-action/action-classification-on-charades)](https://paperswithcode.com/sota/action-classification-on-charades?p=actionclip-a-new-paradigm-for-video-action)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionclip-a-new-paradigm-for-video-action/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=actionclip-a-new-paradigm-for-video-action)`

ActionCLIP: A New Paradigm for Video Action Recognition

17 Sep 2021 · Mengmeng Wang, Jiazheng Xing, Yong liu ·

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git

PDF Abstract

Code

Add Remove Mark official

sallymmx/actionclip official

459

towhee-io/towhee

2,972

Tasks

Add Remove

Action Classification

Action Recognition

Action Recognition In Videos

Prompt Engineering

Temporal Action Localization

Text Matching

Zero-Shot Action Recognition

Datasets

UCF101

Kinetics

HMDB51

Kinetics 400

Charades

Results from the Paper

Edit

Ranked #2 on Action Recognition In Videos on Kinetics-400

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Charades	ActionCLIP (ViT-B/16)	MAP	44.3	# 19	Compare
Action Recognition In Videos	Kinetics-400	ActionCLIP (ViT-B/16)	Top-1 Accuracy	83.8	# 2	Compare
Action Classification	Kinetics-400	ActionCLIP (CLIP-pretrained)	Acc@1	83.8	# 58	Compare
Action Classification	Kinetics-400	ActionCLIP (CLIP-pretrained)	Acc@5	97.1	# 31	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ActionCLIP: A New Paradigm for Video Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove