Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network.
In order to come up with a better representation and capturing of long term spatio-temporal relationships, we propose three variants of Self-Attention Network (SAN), namely, SAN-V1, SAN-V2 and SAN-V3.
#17 best model for Skeleton Based Action Recognition on NTU RGB+D
Second, the second-order information of the skeleton data, i. e., the length and orientation of the bones, is rarely investigated, which is naturally more informative and discriminative for the human action recognition.
Spatio-temporal action localization is a challenging yet fascinating task that aims to detect and classify human actions in video clips.
In this report, we introduce the Winner method for HACS Temporal Action Localization Challenge 2019.
This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately.
CNNs are trained on input images of each modality to learn low-level, high-level and complex features.
For the backbone, we propose multi-branch multi-scale graph convolution networks to extract spatial and temporal features.
We demonstrate the method on six state-of-the-art 3D convolution neural networks (CNNs) on three action recognition (Kinetics-400, UCF-101, and HMDB-51) and two egocentric action recognition datasets (EPIC-Kitchens and EGTEA Gaze+).
The proposed representation has the advantage of combining the use of reference joints and a tree structure skeleton.
#2 best model for Skeleton Based Action Recognition on NTU RGB+D 120