The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #2 on Temporal Action Localization on J-HMDB-21
In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution.
YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.
Ranked #1 on Temporal Action Localization on J-HMDB-21
Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.
Ranked #3 on Temporal Action Localization on THUMOS’14
To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance.
Ranked #1 on Temporal Action Localization on MEXaction2
In this report, we introduce the Winner method for HACS Temporal Action Localization Challenge 2019.
This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos.
This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately.
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
In this work, we first identify two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation.