TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Charades	VicTR (ViT-L/14)	MAP	57.6	# 8
Zero-Shot Action Recognition	HMDB51	VicTR (ViT-B/16)	Top-1 Accuracy	51.0	# 10
Action Classification	Kinetics-400	VicTR (ViT-L/14)	Acc@1	87.0	# 36
Zero-Shot Action Recognition	UCF101	VicTR (ViT-B/16)	Top-1 Accuracy	72.4	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/victr-video-conditioned-text-representations/action-classification-on-charades)](https://paperswithcode.com/sota/action-classification-on-charades?p=victr-video-conditioned-text-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/victr-video-conditioned-text-representations/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=victr-video-conditioned-text-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/victr-video-conditioned-text-representations/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=victr-video-conditioned-text-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/victr-video-conditioned-text-representations/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=victr-video-conditioned-text-representations)`

VicTR: Video-conditioned Text Representations for Activity Recognition

5 Apr 2023 · Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo ·

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Activity Recognition

Zero-Shot Action Recognition

Datasets

UCF101

Kinetics

HMDB51

Kinetics 400

Charades

NExT-QA

Results from the Paper

Edit

Ranked #8 on Action Classification on Charades

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Charades	VicTR (ViT-L/14)	MAP	57.6	# 8	Compare
Zero-Shot Action Recognition	HMDB51	VicTR (ViT-B/16)	Top-1 Accuracy	51.0	# 10	Compare
Action Classification	Kinetics-400	VicTR (ViT-L/14)	Acc@1	87.0	# 36	Compare
Zero-Shot Action Recognition	UCF101	VicTR (ViT-B/16)	Top-1 Accuracy	72.4	# 11	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

VicTR: Video-conditioned Text Representations for Activity Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove