TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Action Recognition	HMDB51	OST	Top-1 Accuracy	55.9	# 8
Zero-Shot Action Recognition	Kinetics	OST	Top-1 Accuracy	75.1	# 2
Zero-Shot Action Recognition	Kinetics	OST	Top-5 Accuracy	94.6	# 1
Zero-Shot Action Recognition	UCF101	OST	Top-1 Accuracy	79.7	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ost-refining-text-knowledge-with-optimal/zero-shot-action-recognition-on-kinetics)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-kinetics?p=ost-refining-text-knowledge-with-optimal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ost-refining-text-knowledge-with-optimal/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=ost-refining-text-knowledge-with-optimal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ost-refining-text-knowledge-with-optimal/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=ost-refining-text-knowledge-with-optimal)`

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

30 Nov 2023 · Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen ·

Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

PDF Abstract

Code

Add Remove Mark official

tomchen-ctj/OST official

↳ Quickstart in

Spaces

Tasks

Add Remove

Descriptive

Language Modelling

Large Language Model

Video Recognition

Zero-Shot Action Recognition

Zero-Shot Action Recognition on HMDB51

Zero-Shot Action Recognition on UCF101

Datasets

ImageNet

UCF101

Kinetics

HMDB51

ActivityNet

Something-Something V2

Results from the Paper

Edit

Ranked #2 on Zero-Shot Action Recognition on Kinetics

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Action Recognition	HMDB51	OST	Top-1 Accuracy	55.9	# 8	Compare
Zero-Shot Action Recognition	Kinetics	OST	Top-1 Accuracy	75.1	# 2	Compare
Zero-Shot Action Recognition	Kinetics	OST	Top-5 Accuracy	94.6	# 1	Compare
Zero-Shot Action Recognition	UCF101	OST	Top-1 Accuracy	79.7	# 9	Compare

Methods

Add Remove

BASE

Edit Social Preview

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove