Action Classification

227 papers with code • 24 benchmarks • 30 datasets

Image source: The Kinetics Human Action Video Dataset

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Classification

Dataset	Best Model	Compare
Kinetics-400	InternVideo2-6B	See all
Kinetics-600	InternVideo2-6B	See all
Charades	TokenLearner	See all
Kinetics-700	InternVideo2-6B	See all
MiT	InternVideo2-6B	See all
Toyota Smarthome dataset	π-ViT	See all
AViD	TokenLearner	See all
THUMOS’14	3C-Net	See all
ActivityNet-1.2	W-TALC	See all
Kinetics-Sounds	Mirasol3B	See all
TTStroke-21 ME22	RGB and PRGB	See all
HMDB51	DualPath w/ ViT-B/16 MLPs.	See all
MiniKinetics	MARS+RGB+Flow (16 frames)	See all
YouCook2	VideoBERT (cross modal)	See all
UCF101	Ours	See all
Something-Something V2	CAST-B/16	See all
THUMOS'14	3C-Net	See all
Jester test	C2F	See all
BABEL	2s-AGCN	See all
ActivityNet	UniFormerV2-L	See all
TTStroke-21 ME21	STCNN	See all
Diving-48	DualPath w/ ViT-B/16	See all
CelebV-HQ	MARLIN	See all
Moments in Time	OmniVec	See all

Show all 24 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Classification models and implementations

open-mmlab/mmaction2

15 papers

3,892

towhee-io/towhee

8 papers

2,991

rwightman/pytorch-image-models

4 papers

29,774

facebookresearch/pytorchvideo

3 papers

3,182

See all 18 libraries.

Datasets

Latest papers

Most implemented Social Latest No code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo • • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

921

22 Mar 2024

Paper
Code

Open-Vocabulary Video Relation Extraction

iriya99/ovre • • 25 Dec 2023

A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions.

25 Dec 2023

Paper
Code

CAST: Cross-Attention in Space and Time for Video Action Recognition

khu-vll/cast • • NeurIPS 2023

In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input.

30 Nov 2023

Paper
Code

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

dominickrei/pi-vit • 30 Nov 2023

To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential.

30 Nov 2023

Paper
Code

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

whwu95/ATM • • 27 Nov 2023

In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video.

27 Nov 2023

Paper
Code

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

leexinhao/ZeroI2V • • 2 Oct 2023

In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i. e., introduce zero extra cost to the adapted models during inference).

02 Oct 2023

Paper
Code

MOFO: MOtion FOcused Self-Supervision for Video Understanding

moohnai/mofo • • 23 Aug 2023

Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos.

23 Aug 2023

Paper
Code