Action Classification
227 papers with code • 24 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Latest papers
Dual-path Adaptation from Image to Video Transformers
In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
Scaling Vision Transformers to 22 Billion Parameters
The scaling of Transformers has driven breakthrough capabilities for language models.
AIM: Adapting Image Models for Efficient Video Action Recognition
Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks.
Baseline Method for the Sport Task of MediaEval 2022 with 3D CNNs using Attention Mechanisms
We propose two types of 3D-CNN architectures to solve the two subtasks.
Fine-Grained Action Detection with RGB and Pose Information using Two Stream Convolutional Networks
As participants of the MediaEval 2022 Sport Task, we propose a two-stream network approach for the classification and detection of table tennis strokes.
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Hierarchical Explanations for Video Action Recognition
To interpret deep neural networks, one main approach is to dissect the visual input and find the prototypical parts responsible for the classification.
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.
Learning Video Representations from Large Language Models
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs).
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.