Understanding movies and their structural patterns is a crucial task to decode the craft of video editing.
Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise.
To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
To showcase the potential of our new dataset, we propose an audiovisual baseline and benchmark for person retrieval.
Active speaker detection requires a solid integration of multi-modal cues.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations.
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation.
Ranked #2 on Video Semantic Segmentation on Cityscapes val
Our results confirm the problems of the previous evaluation protocols, and suggest that an IA-based protocol is more adequate to the online scenario.
The problem of Online Human Behaviour Recognition in untrimmed videos, aka Online Action Detection (OAD), needs to be revisited.
RefineLoc shows competitive results with the state-of-the-art in weakly-supervised temporal localization.
In this paper, we introduce a novel active learning framework for temporal localization that aims to mitigate this data dependency issue.
The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research.
Despite the recent progress in video understanding and the continuous rate of improvement in temporal action localization throughout the years, it is still unclear how far (or close?)
The ActivityNet Large Scale Activity Recognition Challenge 2017 Summary: results and challenge participants papers.
Despite the recent advances in large-scale video analysis, action detection remains as one of the most challenging unsolved problems in computer vision.
To address this need, we propose the new problem of action spotting in video, which we define as finding a specific action in a video while observing a small portion of that video.
In many large-scale video analysis scenarios, one is interested in localizing and recognizing human activities that occur in short temporal intervals within long untrimmed videos.
In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize.