Action Detection

233 papers with code • 11 benchmarks • 33 datasets

Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Detection

Dataset	Best Model	Compare
J-HMDB	HIT	See all
Charades	TTM	See all
UCF101-24	STAR/L	See all
Multi-THUMOS	MLAD	See all
UCF Sports	T-CNN	See all
THUMOS' 14	MAT (Ours) Trans	See all
TSU	PDAN	See all
TTStroke-21 ME22	STCNN-V2 (Vote decision)	See all
TTStroke-21 ME21	STCNN	See all
MultiSports	HIT	See all
MultiTHUMOS	PAT	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Detection models and implementations

open-mmlab/mmaction2

6 papers

3,862

alibaba-damo-academy/FunASR

3 papers

3,062

Frostinassiky/gtad

3 papers

216

towhee-io/towhee

2 papers

2,959

See all 6 libraries.

Datasets

Subtasks

Audio-Visual Active Speaker Detection

Fine-Grained Action Detection

Action Triplet Detection

Few Shot Temporal Action Localization

Multiple Action Detection

Most implemented papers

Most implemented Social Latest No code

Continuous control with deep reinforcement learning

ray-project/ray • 9 Sep 2015

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.

157

Paper
Code

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

wzmsltw/BSN-boundary-sensitive-network.pytorch • • ECCV 2018

Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content.

Paper
Code

SlowFast Networks for Video Recognition

facebookresearch/SlowFast • • ICCV 2019

We present SlowFast networks for video recognition.

Paper
Code

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

PaddlePaddle/models • • ICCV 2019

To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.

Paper
Code

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

tensorflow/models • • CVPR 2018

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Paper
Code

Rescaling Egocentric Vision

epic-kitchens/epic-kitchens-100-annotations • 23 Jun 2020

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.

Paper
Code

Temporal Action Detection with Structured Segment Networks

open-mmlab/mmaction • • ICCV 2017

Detecting actions in untrimmed videos is an important yet challenging task.

Paper
Code

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition

CAMMA-public/cholectriplet2021 • 10 Apr 2022

In this paper, we present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.

Paper
Code

HAKE: Human Activity Knowledge Engine

DirtyHarryLYL/HAKE • • 13 Apr 2019

To address these and promote the activity understanding, we build a large-scale Human Activity Knowledge Engine (HAKE) based on the human body part states.

Paper
Code

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

wei-tim/YOWO • • 15 Nov 2019

YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.

Paper
Code

Action Detection

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result