Action Classification

227 papers with code • 24 benchmarks • 30 datasets

Image source: The Kinetics Human Action Video Dataset

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Classification

Dataset	Best Model	Compare
Kinetics-400	InternVideo2-6B	See all
Kinetics-600	InternVideo2-6B	See all
Charades	TokenLearner	See all
Kinetics-700	InternVideo2-6B	See all
MiT	InternVideo2-6B	See all
Toyota Smarthome dataset	π-ViT	See all
AViD	TokenLearner	See all
THUMOS’14	3C-Net	See all
ActivityNet-1.2	W-TALC	See all
Kinetics-Sounds	Mirasol3B	See all
TTStroke-21 ME22	RGB and PRGB	See all
HMDB51	DualPath w/ ViT-B/16 MLPs.	See all
MiniKinetics	MARS+RGB+Flow (16 frames)	See all
YouCook2	VideoBERT (cross modal)	See all
UCF101	Ours	See all
Something-Something V2	CAST-B/16	See all
THUMOS'14	3C-Net	See all
Jester test	C2F	See all
BABEL	2s-AGCN	See all
ActivityNet	UniFormerV2-L	See all
TTStroke-21 ME21	STCNN	See all
Diving-48	DualPath w/ ViT-B/16	See all
CelebV-HQ	MARLIN	See all
Moments in Time	OmniVec	See all

Show all 24 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Classification models and implementations

open-mmlab/mmaction2

15 papers

3,916

towhee-io/towhee

8 papers

3,003

rwightman/pytorch-image-models

4 papers

29,890

facebookresearch/pytorchvideo

3 papers

3,187

See all 18 libraries.

Datasets

Latest papers

Most implemented Social Latest No code

Dual-path Adaptation from Image to Video Transformers

park-jungin/dualpath • • CVPR 2023

In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.

17 Mar 2023

Paper
Code

Scaling Vision Transformers to 22 Billion Parameters

lucidrains/flash-cosine-sim-attention • • 10 Feb 2023

The scaling of Transformers has driven breakthrough capabilities for language models.

192

10 Feb 2023

Paper
Code

AIM: Adapting Image Models for Efficient Video Action Recognition

taoyang1122/adapt-image-models • • 6 Feb 2023

Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks.

241

06 Feb 2023

Paper
Code

Baseline Method for the Sport Task of MediaEval 2022 with 3D CNNs using Attention Mechanisms

ccp-eva/sporttaskme22 • • 6 Feb 2023

We propose two types of 3D-CNN architectures to solve the two subtasks.

06 Feb 2023

Paper
Code

Fine-Grained Action Detection with RGB and Pose Information using Two Stream Convolutional Networks

fidsinn/sporttaskme22 • • 6 Feb 2023

As participants of the MediaEval 2022 Sport Task, we propose a two-stream network approach for the classification and detection of table tennis strokes.

06 Feb 2023

Paper
Code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

modelscope/modelscope • • 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

6,101

01 Feb 2023

Paper
Code

Hierarchical Explanations for Video Action Recognition

sadafgulshad1/Hierarchical-ProtoPNet • • 1 Jan 2023

To interpret deep neural networks, one main approach is to dissect the visual input and find the prototypical parts responsible for the classification.

01 Jan 2023

Paper
Code

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

whwu95/Cap4Video • • CVPR 2023

In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.

203

31 Dec 2022

Paper
Code

Learning Video Representations from Large Language Models

facebookresearch/lavila • • CVPR 2023

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs).

440

08 Dec 2022

Paper
Code

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

ruiwang2021/mvd • • CVPR 2023

For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.

08 Dec 2022

Paper
Code

Action Classification

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result