Action Classification
227 papers with code • 24 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Latest papers
Joint Skeletal and Semantic Embedding Loss for Micro-gesture Classification
In this paper, we briefly introduce the solution of our team HFUT-VUT for the Micros-gesture Classification in the MiGA challenge at IJCAI 2023.
What Can Simple Arithmetic Operations Do for Temporal Modeling?
We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost.
Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers
Both PAAT and PAAB surpass their respective backbone Transformers by up to 9. 8% in real-world action recognition and 21. 8% in multi-view robotic video alignment.
HomE: Homography-Equivariant Video Representation Learning
In this work, we propose a novel method for representation learning of multi-view videos, where we explicitly model the representation space to maintain Homography Equivariance (HomE).
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention.
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.
The effectiveness of MAE pre-pretraining for billion-scale pretraining
While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.