This paper proposes a self-supervised objective for learning representations that localize objects under occlusion - a property known as object permanence.
Our experiments demonstrate that, despite only capturing a small subset of the objects that move, this signal is enough to generalize to segment both moving and static instances of dynamic objects.
Reasoning about the future behavior of other agents is critical to safe robot navigation.
In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning.
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.
Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" for pretraining.
Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.
To address this concern, we propose two new benchmarks for generic, moving object detection, and show that our model matches top-down methods on common categories, while significantly out-performing both top-down and bottom-up methods on never-before-seen categories.
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.
We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.
Ranked #15 on Unsupervised Video Object Segmentation on DAVIS 2016
The module to build a "visual memory" in video, i. e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences.
The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved.
Ranked #22 on Unsupervised Video Object Segmentation on DAVIS 2016 (using extra training data)
We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.
A relational linear program (RLP) is a declarative LP template defining the objective and the constraints through the logical concepts of objects, relations, and quantified variables.