This paper introduces Click to Move (C2M), a novel framework for video generation where the user can control the motion of the synthesized video through mouse clicks specifying simple object trajectories of the key objects in the scene.
We formulate the entropy of a quantized artificial neural network as a differentiable function that can be plugged as a regularization term into the cost function minimized by gradient descent.
In this work we propose a novel deep learning approach for ultra-low bitrate video compression for video conferencing applications.
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not.
In this work, we tackle the problem of estimating a camera capability to preserve fine texture details at a given lighting condition.
While unsupervised domain adaptation methods based on deep architectures have achieved remarkable success in many computer vision tasks, they rely on a strong assumption, i. e. labeled source data must be available.
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences.
To achieve this, we decouple appearance and motion information using a self-supervised formulation.
Ranked #1 on Video Reconstruction on Tai-Chi-HD
Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction.
To implement this idea we derive specialized deep models for each domain by adapting a pre-trained architecture but, differently from other methods, we propose a novel strategy to automatically adjust the computational complexity of the network.
We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images.
Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa.
Our proposal is evaluated on the wellestablished KITTI dataset, where we show that our online method is competitive withstate of the art algorithms trained in a batch setting.
Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth.
In this paper we address the problems of detecting objects of interest in a video and of estimating their locations, solely from the gaze directions of people present in the video.
This is achieved through a deep architecture that decouples appearance and motion information.
In this paper, we address the problem of how to robustly train a ConvNet for regression, or deep robust regression.
Deep learning revolutionized data science, and recently its popularity has grown exponentially, as did the amount of papers employing deep networks.
Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision.