Action Recognition

593 papers with code • 34 benchmarks • 84 datasets

Human action recognition has become an active research area in recent years, as it plays a significant role in video understanding. In general, human action can be recognized from multiple modalities, such as appearance, depth, optical flows, and body skeletons.

In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.

Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.


Use these libraries to find Action Recognition models and implementations
20 papers
10 papers
4 papers
See all 5 libraries.

Most implemented papers

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

kenshohara/3D-ResNets-PyTorch CVPR 2018

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

open-mmlab/mmaction2 CVPR 2017

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.

Non-local Neural Networks

facebookresearch/video-nonlocal-net CVPR 2018

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

yysijie/st-gcn 23 Jan 2018

Dynamics of human body skeletons convey significant information for human action recognition.

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

CIDEr: Consensus-based Image Description Evaluation

tylin/coco-caption CVPR 2015

We propose a novel paradigm for evaluating image descriptions that uses human consensus.

Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

adityac94/Grad_CAM_plus_plus 30 Oct 2017

Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

yjxiong/temporal-segment-networks 2 Aug 2016

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Learning Spatiotemporal Features with 3D Convolutional Networks

open-mmlab/mmaction2 ICCV 2015

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.

A Closer Look at Spatiotemporal Convolutions for Action Recognition

facebookresearch/R2Plus1D CVPR 2018

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.