Action Recognition
881 papers with code • 49 benchmarks • 105 datasets
Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.
In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.
Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.
Libraries
Use these libraries to find Action Recognition models and implementationsDatasets
Subtasks
- Action Recognition In Videos
- 3D Action Recognition
- Self-Supervised Action Recognition
- Few Shot Action Recognition
- Few Shot Action Recognition
- Fine-grained Action Recognition
- Action Triplet Recognition
- Open Set Action Recognition
- Micro-Action Recognition
- Weakly-Supervised Action Recognition
- Atomic action recognition
- Animal Action Recognition
- Transportation Mode Detection
- Open Vocabulary Action Recognition
- Action Recognition In Still Images
Latest papers
CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment
However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA.
Aligning Actions and Walking to LLM-Generated Textual Descriptions
For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations.
VG4D: Vision-Language Model Goes 4D Video Recognition
By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance.
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos
Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples.
TIM: A Time Interval Machine for Audio-Visual Action Recognition
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
PREGO: online mistake detection in PRocedural EGOcentric videos
We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos.
Disentangled Pre-training for Human-Object Interaction Detection
Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem.
OmniVid: A Generative Framework for Universal Video Understanding
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
We interact with the world with our hands and see it through our own (egocentric) perspective.
Understanding Long Videos in One Multimodal Language Model Pass
In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information.