This paper introduces Click to Move (C2M), a novel framework for video generation where the user can control the motion of the synthesized video through mouse clicks specifying simple object trajectories of the key objects in the scene.
This paper presents solo-learn, a library of self-supervised methods for visual representation learning.
In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes.
In this paper, we study the task of source-free domain adaptation (SFDA), where the source data are not available during target adaptation.
In this paper we address multi-target domain adaptation (MTDA), where given one labeled source dataset and multiple unlabeled target datasets that differ in data distributions, the task is to learn a robust predictor for all the target domains.
Ranked #1 on Multi-target Domain Adaptation on Office-Home
Most deep UDA approaches operate in a single-source, single-target scenario, i. e. they assume that the source and the target samples arise from a single distribution.
In this work, we provide a general formulation of binary mask based models for multi-domain learning by affine transformations of the original network parameters.
While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation.
Ranked #3 on Depth Estimation on NYU-Depth V2
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework, leading to Variational STructured Attention networks (VISTA-Net).
In contrast to previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner.
State-of-the-art performances in dense pixel-wise prediction tasks are obtained with specifically designed convolutional networks.
Pseudo-LiDAR-based methods for monocular 3D object detection have received considerable attention in the community due to the performance gains exhibited on the KITTI3D benchmark, in particular on the commonly reported validation split.
Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert.
In the case of LiDAR, in fact, domain shift is not only due to changes in the environment and in the object appearances, as for visual data from RGB cameras, but is also related to the geometry of the point clouds (e. g., point density variations).
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data.
While unsupervised domain adaptation methods based on deep architectures have achieved remarkable success in many computer vision tasks, they rely on a strong assumption, i. e. labeled source data must be available.
Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e. g. semantic segmentation, depth estimation).
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences.
The key idea of CuMix is to simulate the test-time domain and semantic shift using images and features from unseen domains and categories generated by mixing up the multiple source domains and categories available during training.
While convolutional neural networks have brought significant advances in robot vision, their ability is often limited to closed world scenarios, where the number of semantic concepts to be recognized is determined by the available training set.
In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks.
To achieve this, we decouple appearance and motion information using a self-supervised formulation.
Ranked #1 on Video Reconstruction on Tai-Chi-HD
Current strategies fail on this task because they do not consider a peculiar aspect of semantic segmentation: since each training step provides annotation only for a subset of all possible classes, pixels of the background class (i. e. pixels that do not belong to any other classes) exhibit a semantic distribution shift.
While expensive LiDAR and stereo camera rigs have enabled the development of successful 3D object detection methods, monocular RGB-only approaches lag much behind.
Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction.
While today's robots are able to perform sophisticated tasks, they can only act on objects they have been trained to recognize.
To implement this idea we derive specialized deep models for each domain by adapting a pre-trained architecture but, differently from other methods, we propose a novel strategy to automatically adjust the computational complexity of the network.
Our proposal is evaluated on the wellestablished KITTI dataset, where we show that our online method is competitive withstate of the art algorithms trained in a batch setting.
Deep networks have brought significant advances in robot perception, enabling to improve the capabilities of robots in several visual tasks, ranging from object detection and recognition to pose estimation, semantic scene segmentation and many others.
The ability to categorize is a cornerstone of visual intelligence, and a key functionality for artificial, autonomous visual machines.
Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth.
A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift.
This is achieved through a deep architecture that decouples appearance and motion information.
Our approach takes as input a natural image and exploits recent models for deep style transfer and generative adversarial networks to change its style in order to modify a specific high-level attribute.
The proposed architecture consists of two generative sub-networks jointly trained with adversarial learning for reconstructing the disparity map and organized in a cycle such as to provide mutual constraints and supervision to each other.
This novel dataset allows for testing the robustness of robot visual recognition algorithms to a series of different domain shifts both in isolation and unified.
A long standing problem in visual object categorization is the ability of algorithms to generalize across different testing conditions.
Our method develops from the intuition that, given a set of different classification models associated to known domains (e. g. corresponding to multiple environments, robots), the best model for a new sample in the novel domain can be computed directly at test time by optimally combining the known models.
Our approach is based on the introduction of two main components, which can be embedded into any existing CNN architecture: (i) a side branch that automatically computes the assignment of a source sample to a latent domain and (ii) novel layers that exploit domain membership information to appropriately align the distribution of the CNN internal feature representations to a reference distribution.
Recent works have shown the benefit of integrating Conditional Random Fields (CRFs) models into deep architectures for improving pixel-level prediction tasks.
In this paper we address the problem of learning robust cross-domain representations for sketch-based image retrieval (SBIR).
Depth cues have been proved very useful in various computer vision and robotic tasks.
Recent works have shown that exploiting multi-scale representations deeply learned via convolutional neural networks (CNN) is of tremendous importance for accurate contour detection.
Here we take a different route, proposing to align the learned representations by embedding in any given network specific Domain Alignment Layers, designed to match the source and target feature distributions to a reference one.
Then, the learned feature representations are transferred to a second deep network, which receives as input an RGB image and outputs the detection results.
This paper addresses the problem of depth estimation from a single still image.
Ranked #8 on Depth Estimation on NYU-Depth V2
In this work, we show that it is possible to automatically retrieve the best style seeds for a given image, thus remarkably reducing the number of human attempts needed to find a good match.
In our overly-connected world, the automatic recognition of virality - the quality of an image or video to be rapidly and widely spread in social networks - is of crucial importance, and has recently awaken the interest of the computer vision community.
This paper presents an approach for semantic place categorization using data obtained from RGB cameras.
The empirical fact that classifiers, trained on given data collections, perform poorly when tested on data acquired in different settings is theoretically explained in domain adaptation through a shift among distributions of the source and target domains.
A very popular approach for transductive multi-label recognition under linear classification settings is matrix completion.
Recent studies in computer vision have shown that, while practically invisible to a human observer, skin color changes due to blood flow can be captured on face videos and, surprisingly, be used to estimate the heart rate (HR).
We present a novel approach for jointly estimating tar- gets' head, body orientations and conversational groups called F-formations from a distant social scene (e. g., a cocktail party captured by surveillance cameras).
We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes.
Studying free-standing conversational groups (FCGs) in unstructured social settings (e. g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels.