Action Recognition

858 papers with code • 49 benchmarks • 104 datasets

Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.

In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.

Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.

Libraries

Use these libraries to find Action Recognition models and implementations
20 papers
3,805
10 papers
2,926
4 papers
547
See all 7 libraries.

Most implemented papers

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

tensorflow/tpu ICML 2019

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available.

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

open-mmlab/mmaction2 CVPR 2017

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.

Non-local Neural Networks

facebookresearch/video-nonlocal-net CVPR 2018

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.

Learning Spatiotemporal Features with 3D Convolutional Networks

facebookarchive/C3D ICCV 2015

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

kenshohara/3D-ResNets-PyTorch CVPR 2018

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels.

CIDEr: Consensus-based Image Description Evaluation

tylin/coco-caption CVPR 2015

We propose a novel paradigm for evaluating image descriptions that uses human consensus.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

yysijie/st-gcn 23 Jan 2018

Dynamics of human body skeletons convey significant information for human action recognition.

Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

adityac94/Grad_CAM_plus_plus 30 Oct 2017

Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.

A Closer Look at Spatiotemporal Convolutions for Action Recognition

facebookresearch/R2Plus1D CVPR 2018

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.