TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Action Recognition	HMDB51	MOV (ViT-B/16)	Top-1 Accuracy	60.8	# 4
Zero-Shot Action Recognition	HMDB51	MOV (ViT-L/14)	Top-1 Accuracy	64.7	# 1
Zero-Shot Action Recognition	UCF101	MOV (ViT-B/16)	Top-1 Accuracy	82.6	# 8
Zero-Shot Action Recognition	UCF101	MOV (ViT-L/14)	Top-1 Accuracy	87.1	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-open-vocabulary-video/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=multimodal-open-vocabulary-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-open-vocabulary-video/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=multimodal-open-vocabulary-video)`

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

15 Jul 2022 · Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui ·

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Optical Flow Estimation

Video Classification

Zero-Shot Action Recognition

Datasets

ImageNet

UCF101

Kinetics

HMDB51

AudioSet

VGG-Sound

Results from the Paper

Edit

Ranked #1 on Zero-Shot Action Recognition on HMDB51

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Action Recognition	HMDB51	MOV (ViT-B/16)	Top-1 Accuracy	60.8	# 4	Compare
Zero-Shot Action Recognition	HMDB51	MOV (ViT-L/14)	Top-1 Accuracy	64.7	# 1	Compare
Zero-Shot Action Recognition	UCF101	MOV (ViT-B/16)	Top-1 Accuracy	82.6	# 8	Compare
Zero-Shot Action Recognition	UCF101	MOV (ViT-L/14)	Top-1 Accuracy	87.1	# 3	Compare

Methods

Add Remove

BASE

Edit Social Preview

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove