To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance.
Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it.
#8 best model for Action Classification on Charades (using extra training data)
This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos.
For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA.
To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output.
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks.
#4 best model for Weakly Supervised Action Localization on THUMOS 2014
Previous methods address the problem by considering features from video sliding windows and language queries and learning a subspace to encode their correlation, which ignore rich semantic cues about activities in videos and queries.
In this paper, we introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query.
Among other uses, VERA enables the localization of a shooter from just a few videos that include the sound of gunshots.