TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Action Recognition	HMDB51	X-CLIP	Top-1 Accuracy	44.6	# 11
Zero-Shot Action Recognition	Kinetics	X-CLIP	Top-1 Accuracy	65.2	# 8
Zero-Shot Action Recognition	Kinetics	X-CLIP	Top-5 Accuracy	86.1	# 5
Action Classification	Kinetics-400	X-CLIP(ViT-L/14, CLIP)	Acc@1	87.7	# 26
Action Classification	Kinetics-400	X-CLIP(ViT-L/14, CLIP)	Acc@5	97.4	# 24
Action Classification	Kinetics-600	X-CLIP(ViT-L/14, CLIP)	Top-1 Accuracy	88.3	# 20
Action Classification	Kinetics-600	X-CLIP(ViT-L/14, CLIP)	Top-5 Accuracy	97.7	# 13
Zero-Shot Action Recognition	UCF101	X-CLIP	Top-1 Accuracy	72.0	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/expanding-language-image-pretrained-models/zero-shot-action-recognition-on-kinetics)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-kinetics?p=expanding-language-image-pretrained-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/expanding-language-image-pretrained-models/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=expanding-language-image-pretrained-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/expanding-language-image-pretrained-models/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=expanding-language-image-pretrained-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/expanding-language-image-pretrained-models/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=expanding-language-image-pretrained-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/expanding-language-image-pretrained-models/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=expanding-language-image-pretrained-models)`

Expanding Language-Image Pretrained Models for General Video Recognition

4 Aug 2022 · Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling ·

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP

PDF Abstract

Code

Add Remove Mark official

microsoft/videox official

930

microsoft/VideoX official

930

Tasks

Add Remove

Action Classification

Action Recognition

Video Recognition

Zero-Shot Action Recognition

Zero-shot Generalization

Datasets

ImageNet

UCF101

Kinetics

HMDB51

Kinetics 400

Kinetics-600

Results from the Paper

Edit

Ranked #8 on Zero-Shot Action Recognition on Kinetics

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Action Recognition	HMDB51	X-CLIP	Top-1 Accuracy	44.6	# 11	Compare
Zero-Shot Action Recognition	Kinetics	X-CLIP	Top-1 Accuracy	65.2	# 8	Compare
Zero-Shot Action Recognition	Kinetics	X-CLIP	Top-5 Accuracy	86.1	# 5	Compare
Action Classification	Kinetics-400	X-CLIP(ViT-L/14, CLIP)	Acc@1	87.7	# 26	Compare
Action Classification	Kinetics-400	X-CLIP(ViT-L/14, CLIP)	Acc@5	97.4	# 24	Compare
Action Classification	Kinetics-600	X-CLIP(ViT-L/14, CLIP)	Top-1 Accuracy	88.3	# 20	Compare
Action Classification	Kinetics-600	X-CLIP(ViT-L/14, CLIP)	Top-5 Accuracy	97.7	# 13	Compare
Zero-Shot Action Recognition	UCF101	X-CLIP	Top-1 Accuracy	72.0	# 12	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Expanding Language-Image Pretrained Models for General Video Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove