Action Classification
227 papers with code • 24 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Latest papers with no code
Learning Correlation Structures for Vision Transformers
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.
Classification of Tennis Actions Using Deep Learning
Recent advances of deep learning makes it possible to identify specific events in videos with greater precision.
Robustness Evaluation of Machine Learning Models for Robot Arm Action Recognition in Noisy Environments
This paper studies robot arm action recognition in noisy environments using machine learning techniques.
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts.
ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition in the Operating Room
Surgical robotics holds much promise for improving patient safety and clinician experience in the Operating Room (OR).
AdaFocus: Towards End-to-end Weakly Supervised Learning for Long-Video Action Understanding
Under the weak supervision setting, action labels are provided for the whole video without precise start and end times of the action clip.
ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization
This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set.
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.
OmniVec: Learning robust representations with cross modal sharing
We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks.
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
And AMD achieves 73. 3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3. 7% improvement over the original ViT-B model from VideoMAE.