In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation.
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs).
Controllable person image generation aims to produce realistic human images with desirable attributes (e. g., the given pose, cloth textures or hair style).
In this paper we address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need for precise annotations of the gaze angle and the head pose.
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences.
Most of the current self-supervised representation learning (SSL) methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives").
In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks.
In this paper, we propose to alleviate these problems by means of a novel gaze redirection framework which exploits both a numerical and a pictorial direction guidance, jointly with a coarse-to-fine learning strategy.
We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images.
Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa.
Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed.
A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift.
Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose.
Ranked #5 on Gesture-to-Gesture Translation on NTU Hand Digit
In this paper we address the abnormality detection problem in crowded scenes.
Ranked #1 on Abnormal Event Detection In Video on UCSD
Abnormal crowd behaviour detection attracts a large interest due to its importance in video surveillance scenarios.
In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities.
In this paper, we show that keeping track of the changes in the CNN feature across time can facilitate capturing the local abnormality.
The main idea is to iteratively select a subset of images and boxes that are the most reliable, and use them for training.
Ranked #26 on Weakly Supervised Object Detection on PASCAL VOC 2007
The combination of appearance-based static ''objectness'' (Selective Search), motion information (Dense Trajectories) and transductive learning (detectors are forced to "overfit" on the unsupervised data used for training) makes the proposed approach extremely robust.