TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Action Classification	Kinetics-400	EVL (CLIP ViT-L/14@336px, frozen, 32 frames)	Acc@1	87.7	# 26
Action Classification	Kinetics-400	EVL (CLIP ViT-L/14@336px, frozen, 32 frames)	Acc@5	97.8	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/frozen-clip-models-are-efficient-video/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=frozen-clip-models-are-efficient-video)`

Frozen CLIP Models are Efficient Video Learners

6 Aug 2022 · Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li ·

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

PDF Abstract

Code

Add Remove Mark official

opengvlab/efficient-video-recogniti… official

155

chenhsing/svformer

Tasks

Add Remove

Action Classification

Video Recognition

Datasets

Kinetics

Kinetics 400

Results from the Paper

Add Remove

Ranked #26 on Action Classification on Kinetics-400 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Benchmark
Action Classification	Kinetics-400	EVL (CLIP ViT-L/14@336px, frozen, 32 frames)	Acc@1	87.7	# 26		Compare
Action Classification	Kinetics-400	EVL (CLIP ViT-L/14@336px, frozen, 32 frames)	Acc@5	97.8	# 12		Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • CLIP • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Frozen CLIP Models are Efficient Video Learners

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove