15 papers with code • 5 benchmarks • 2 datasets
In this task, the training data consists of videos with a list of activities in them without any temporal boundary annotations. However, while testing, given a video, the algorithm should recognize the activities in the video and also provide the start and end time.
We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet.
Ranked #3 on Action Classification on THUMOS’14
This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately.
In this work, we first identify two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation.
Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles.
Ranked #2 on Weakly Supervised Action Localization on THUMOS’14
We propose `Hide-and-Seek', a weakly-supervised framework that aims to improve object localization in images and action localization in videos.
Ranked #9 on Weakly Supervised Action Localization on THUMOS 2014
Second, we propose an actor-based attention mechanism that enables the localization of the actions from action class labels and actor proposals and is end-to-end trainable.
Our joint formulation has three terms: a classification term to ensure the separability of learned action features, an adapted multi-label center loss term to enhance the action feature discriminability and a counting loss term to delineate adjacent action sequences, leading to improved localization.
Ranked #1 on Action Classification on THUMOS'14
Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i. e. background features) from activity-related features for each video.