Firstly, randomly masked face images are used to train the reconstruction module in FaceMAE.
We propose a unified look at jointly learning multiple vision tasks and visual domains through universal representations, a single deep neural network.
In this work, we address two key limitations of such representations, in failing to capture local 3D geometric fine details, and to learn from and generalize to shapes with unseen 3D transformations.
Dataset condensation aims at reducing the network training effort through condensing a cumbersome training set into a compact synthetic one.
Despite the recent advances in multi-task learning of dense prediction problems, most methods rely on expensive labelled datasets.
Scene graph generation (SGG) aims to capture a wide variety of interactions between pairs of objects, which is essential for full scene understanding.
Computational cost of training state-of-the-art deep models in many learning problems is rapidly increasing due to more sophisticated models and larger datasets.
In this paper, we look at the problem of cross-domain few-shot classification that aims to learn a classifier from previously unseen classes and domains with few labeled samples.
Deep learning approaches heavily rely on high-quality human supervision which is nonetheless expensive, time-consuming, and error-prone, especially for image segmentation task.
In this paper, we look at the problem of few-shot classification that aims to learn a classifier for previously unseen classes and domains from few labeled samples.
In many machine learning problems, large-scale datasets have become the de-facto standard to train state-of-the-art deep networks at the price of heavy computation load.
Understanding the 3D world without supervision is currently a major challenge in computer vision as the annotations required to supervise deep networks for tasks in this domain are expensive to obtain on a large scale.
Deep anomaly detection is a difficult task since, in high dimensions, it is hard to completely characterize a notion of "differentness" when given only examples of normality.
As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them become significantly more expensive.
With the explosion of digital data in recent years, continuously learning new tasks from a stream of data without forgetting previously acquired knowledge has become increasingly important.
While recent techniques in domain adaptation and multi-domain learning enable the learning of more domain-agnostic features, their success relies on the presence of domain labels, typically requiring manual annotation and careful curation of datasets.
Recent semi-supervised learning methods have shown to achieve comparable results to their supervised counterparts while using only a small portion of labels in image classification tasks thanks to their regularization strategies.
Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them.
In this paper, we are rather interested by the locations of an image that contribute to the model's training.
Image deconvolution is the process of recovering convolutional degraded images, which is always a hard inverse problem because of its mathematically ill-posed property.
Equivariance to random image transformations is an effective method to learn landmarks of object categories, such as the eyes and the nose in faces, without manual supervision.
Ranked #1 on Unsupervised Facial Landmark Detection on 300W
The use of computational methods to evaluate aesthetics in photography has gained interest in recent years due to the popularization of convolutional neural networks and the availability of new annotated datasets.
We propose KeypointGAN, a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses.
Detecting temporal extents of human actions in videos is a challenging computer vision problem that requires detailed manual supervision including frame-level labels.
We propose a new approach to model and learn, without manual supervision, the symmetries of natural objects, such as faces or flowers, given only images as input.
We propose a method for learning landmark detectors for visual objects (such as the eyes and the nose in a face) without any manual supervision.
Ranked #1 on Unsupervised Facial Landmark Detection on MAFL
A practical limitation of deep neural networks is their high degree of specialization to a single task and visual domain.
One of the key challenges of visual perception is to extract abstract models of 3D objects and object categories from visual measurements, which are affected by complex nuisance factors such as viewpoint, occlusion, motion, and deformations.
Ranked #3 on Unsupervised Facial Landmark Detection on AFLW-MTFL
There is a growing interest in learning data representations that work well for many different types of problems and data.
Learning automatically the structure of object categories remains an important open problem in computer vision.
Ranked #2 on Unsupervised Facial Landmark Detection on AFLW-MTFL
With the advent of large labelled datasets and high-capacity models, the performance of machine vision systems has been improving rapidly.
Ranked #14 on Continual Learning on visual domain decathlon (10 tasks)
This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos.
On action classification, our method obtains 60. 3\% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current state-of-the-art self-supervised learning methods.
Ranked #39 on Self-Supervised Action Recognition on UCF101
Modern discriminative predictors have been shown to match natural intelligences in specific perceptual tasks in image classification, object and part detection, boundary extraction, etc.
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used.
Ranked #58 on Action Recognition on HMDB-51
Weakly supervised learning of object detection is an important problem in image understanding that still does not have a satisfactory solution.
Ranked #3 on Weakly Supervised Object Detection on HICO-DET
However, as learning appearance and localization are two interconnected tasks, the optimization is not convex and the procedure can easily get stuck in a poor local minimum, the algorithm "misses" the object in some images.
In classification of objects substantial work has gone into improving the low level representation of an image by considering various aspects such as different features, a number of feature pooling and coding techniques and considering different kernels.