Vision Transformers (ViTs) have become one of the dominant architectures in computer vision, and pre-trained ViT models are commonly adapted to new tasks via fine-tuning.
Concretely, we propose "Test-Time Adaptation for Class-Incremental Learning" (TTACIL) that first fine-tunes Layer Norm parameters of the PTM on each test instance for learning task-specific features, and then resets them back to the base model to preserve stability.
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data.
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process.
In this paper, we address the well-known image quality assessment problem but in contrast from existing approaches that predict image quality independently for every images, we propose to jointly model different images depicting the same content to improve the precision of quality estimation.
The class affinity matrix is introduced as a first layer to the source model to make it compatible with the target label maps, and the source model is then further finetuned for the target domain.
Departing from the common notion of transferring only the target ``texture'' information, we leverage text-to-image diffusion models (e. g., Stable Diffusion) to generate a synthetic target dataset with photo-realistic images that not only faithfully depict the style of the target domain, but are also characterized by novel scenes in diverse contexts.
The key to learning such game AI is the exploitation of a large and diverse text corpus, collected in this work, describing detailed actions in a game and used to train our animation model.
To allow for more control, image synthesis can be conditioned on semantic segmentation maps that instruct the generator the position of objects in the image.
In this work, we address the task of unconditional head motion generation to animate still human faces in a low-dimensional semantic space from a single reference pose.
In this work we address multi-target domain adaptation (MTDA) in semantic segmentation, which consists in adapting a single model from an annotated source dataset to multiple unannotated target datasets that differ in their underlying data distributions.
In this work, we propose a novel architecture for face age editing that can produce structural modifications while maintaining relevant details present in the original image.
Our experiments show the effectiveness of our segmentation approach on thousands of real-world point clouds.
Therefore, we present a new yet practical online setting for Unsupervised Domain Adaptation for person Re-ID with two main constraints: Online Adaptation and Privacy Protection.
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
This paper introduces Click to Move (C2M), a novel framework for video generation where the user can control the motion of the synthesized video through mouse clicks specifying simple object trajectories of the key objects in the scene.
We formulate the entropy of a quantized artificial neural network as a differentiable function that can be plugged as a regularization term into the cost function minimized by gradient descent.
Therefore, we propose to increase the network capacity by using an adaptive graph structure.
In this work we propose a novel deep learning approach for ultra-low bitrate video compression for video conferencing applications.
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not.
In this work, we tackle the problem of estimating a camera capability to preserve fine texture details at a given lighting condition.
While unsupervised domain adaptation methods based on deep architectures have achieved remarkable success in many computer vision tasks, they rely on a strong assumption, i. e. labeled source data must be available.
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences.
To overcome this limitation, we propose a self-supervised deep learning method for co-part segmentation.
Ranked #2 on Unsupervised Human Pose Estimation on Tai-Chi-HD
To achieve this, we decouple appearance and motion information using a self-supervised formulation.
Ranked #1 on Video Reconstruction on Tai-Chi-HD
Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction.
To implement this idea we derive specialized deep models for each domain by adapting a pre-trained architecture but, differently from other methods, we propose a novel strategy to automatically adjust the computational complexity of the network.
We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images.
Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa.
Our proposal is evaluated on the wellestablished KITTI dataset, where we show that our online method is competitive withstate of the art algorithms trained in a batch setting.
Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth.
In this paper we address the problems of detecting objects of interest in a video and of estimating their locations, solely from the gaze directions of people present in the video.
This is achieved through a deep architecture that decouples appearance and motion information.
In this paper, we address the problem of how to robustly train a ConvNet for regression, or deep robust regression.
Deep learning revolutionized data science, and recently its popularity has grown exponentially, as did the amount of papers employing deep networks.
Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision.