13 papers with code • 1 benchmarks • 1 datasets
Temporal Action Localization with weak supervision where only video-level labels are given for training
This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately.
In this work, we first identify two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation.
Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles.
Ranked #2 on Weakly Supervised Action Localization on THUMOS’14
In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance.
Our joint formulation has three terms: a classification term to ensure the separability of learned action features, an adapted multi-label center loss term to enhance the action feature discriminability and a counting loss term to delineate adjacent action sequences, leading to improved localization.
Ranked #1 on Action Classification on THUMOS'14
Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i. e. background features) from activity-related features for each video.
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks.
We propose a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances.
Ranked #1 on Temporal Action Localization on ActivityNet-1.2
Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary.