By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
We propose a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances.
Annotating videos is cumbersome, expensive and not scalable.
Spatio-temporal action localization is a challenging yet fascinating task that aims to detect and classify human actions in video clips.
In this report, we introduce the Winner method for HACS Temporal Action Localization Challenge 2019.
This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately.
YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.
Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.
Our joint formulation has three terms: a classification term to ensure the separability of learned action features, an adapted multi-label center loss term to enhance the action feature discriminability and a counting loss term to delineate adjacent action sequences, leading to improved localization.
SOTA for Action Classification on THUMOS’14
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
SOTA for Video Retrieval on MSR-VTT