We propose a soft-label sorting network along with the counting network, which sorts the given images by their crowd numbers.
In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning.
Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues.
Ranked #2 on Fine-Grained Image Classification on Stanford Dogs
However, simply removing the PEs may not only harm the convergence and accuracy of ViTs but also places the model at more severe privacy risk.
In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-free curriculum during the initial training epochs.
Ranked #1 on Speech Emotion Recognition on CREMA-D
Inspired by the human visual system for mining local patterns, we propose a new framework called RK-Net to jointly learn the discriminative Representation and detect salient Keypoints with a single Network.
Ranked #2 on Image-Based Localization on cvusa
Inspired by human memory, we propose to represent history with only important changes in the environment and, in our approach, to obtain automatically this representation using self-supervision.
Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning.
In this paper, we tackle the problem of synthesizing a ground-view panorama image conditioned on a top-view aerial image, which is a challenging problem due to the large gap between the two image domains with different view-points.
Following this line of work, we propose a new hyperbolic-based model for metric learning.
Ranked #1 on Metric Learning on CUB-200-2011
During local training, the DFS are used to synthesize novel domain statistics with the proposed domain hallucinating, which is achieved by re-weighting DFS with random weights.
In this manner, the model will focus on reducing the inter-modality discrepancy while paying less attention to intra-identity variations, leading to a more effective modality alignment.
To learn more discriminative class-specific feature representations for the local generation, we also propose a novel classification module.
Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG.
Specifically, the detail modeling focuses on capturing the object edges by supervision of explicitly decomposed detail label that consists of the pixels that are nested on the edge and near the edge.
Computing the matrix square root and its inverse in a differentiable manner is important in a variety of computer vision tasks.
Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution.
Specifically, we propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies across the given meshes.
We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes.
However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a great interest for many computer graphics applications.
We propose a novel framework, i. e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image manipulation that requires little manual annotation while being applicable to a wide variety of manipulations.
Specifically, in the image translation stage, Bi-Mix leverages the knowledge of day-night image pairs to improve the quality of nighttime image relighting.
The global alignment network aims to transfer the input image from the source domain to the target domain.
Instead, we introduce AniFormer, a novel Transformer-based architecture, that generates animated 3D sequences by directly taking the raw driving sequences and arbitrary same-type target meshes as inputs.
To ease this problem, we propose a novel two-stage framework with a new Cascaded Cross MLP-Mixer (CrossMLP) sub-network in the first stage and one refined pixel-level loss in the second stage.
Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains and show that we can achieve the state of the art.
The ISF manipulates the semantics of an input latent code to make the image generated from it lying in the desired visual domain.
In this paper, we address the task of layout-to-image translation, which aims to translate an input semantic layout to a realistic image.
With the strength of deep generative models, 3D pose transfer regains intensive research interests in recent years.
This paper presents solo-learn, a library of self-supervised methods for visual representation learning.
Both generators are mutually connected and trained in an end-to-end fashion and explicitly form three cycled subnets, i. e., one image generation cycle and two guidance generation cycles.
In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes.
In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation.
This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce.
Second, CPSS can reduce the influence of noisy pseudo-labels and also avoid the model overfitting to the target domain during self-supervised learning, consistently boosting the performance on the target and open domains.
Controllable person image generation aims to produce realistic human images with desirable attributes (e. g., the given pose, cloth textures or hair style).
In this paper, we study the task of source-free domain adaptation (SFDA), where the source data are not available during target adaptation.
2D image-based virtual try-on has attracted increased attention from the multimedia and computer vision communities.
In this paper we address multi-target domain adaptation (MTDA), where given one labeled source dataset and multiple unlabeled target datasets that differ in data distributions, the task is to learn a robust predictor for all the target domains.
Ranked #1 on Multi-target Domain Adaptation on Office-Home
While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation.
Ranked #4 on Depth Estimation on NYU-Depth V2
This paper considers the problem of unsupervised person re-identification (re-ID), which aims to learn discriminative models with unlabeled data.
Moreover, our method also achieves competitive performance compared with recent works on existing vehicle ReID datasets including VehicleID, VeRi-776 and VERI-Wild.
Training machine learning models in a meaningful order, from the easy samples to the hard ones, using curriculum learning can provide performance improvements over the standard training approach based on random data shuffling, without any additional computational costs.
In contrast to previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner.
Specifically, we propose a task-driven similarity metric based on sample's mutual enhancement, referred as co-fine-tune similarity, which can find a more efficient subset of data for training the expert network.
We propose the Semantically-Adaptive UpSampling (SA-UpSample), a general and highly effective upsampling method for the layout-to-image translation task.
In this paper, we investigate the usage of CNNs that are designed to work directly with the DCT coefficients available in JPEG compressed images, proposing a handcrafted and data-driven techniques for reducing the computational complexity and the number of parameters for these models in order to keep their computational cost similar to their RGB baselines.
In this paper, we study the problem of multi-source domain generalization in ReID, which aims to learn a model that can perform well on unseen domains with only several labeled source domains.
By providing real image samples with traffic context to the network, the model learns to detect and classify elements of interest, such as pedestrians, traffic signs, and traffic lights.
In the case of LiDAR, in fact, domain shift is not only due to changes in the environment and in the object appearances, as for visual data from RGB cameras, but is also related to the geometry of the point clouds (e. g., point density variations).
We also propose two novel modules, i. e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively.
In this paper we propose the use of an image retrieval system to assist the image-to-image translation task.
Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation.
We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the challenging person image generation task.
Ranked #1 on Pose Transfer on Market-1501 (PCKh metric)
In this paper we address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need for precise annotations of the gaze angle and the head pose.
The method does not aim at overcoming the training with real data, but to be a compatible alternative when the real data is not available.
We propose a novel Generative Adversarial Network (XingGAN or CrossingGAN) for person image generation tasks, i. e., translating the pose of a given person to a desired one.
Ranked #1 on Pose Transfer on Market-1501 (IS metric)
Most of the current self-supervised representation learning (SSL) methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives").
Then the activated dictionary atoms are assembled and passed to the compound dictionary learning and coding layers.
We demonstrate that the proposed method is able to boost the performance of existing pose estimation pipelines on our HiEve dataset.
In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks.
In this paper, a fully automatic technique for labelling an image based gaze behavior dataset for driver gaze zone estimation is proposed.
In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes.
In this paper, we propose to alleviate these problems by means of a novel gaze redirection framework which exploits both a numerical and a pictorial direction guidance, jointly with a coarse-to-fine learning strategy.
The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices.
To tackle the first challenge, we propose to use the edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module.
Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images.
To achieve this, we decouple appearance and motion information using a self-supervised formulation.
Ranked #1 on Video Reconstruction on Tai-Chi-HD
In the first stage, the input image and the conditional semantic guidance are fed into a cycled semantic-guided generation network to produce initial coarse results.
We show that a standard neuron followed by the novel apical dendrite activation (ADA) can learn the XOR logical function with 100\% accuracy.
Ranked #5 on Speech Emotion Recognition on CREMA-D
Deep learning revolution happened thanks to the availability of a massive amount of labelled data which have contributed to the development of models with extraordinary inference capabilities.
To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details.
In this paper, we analyze the limitation of the existing symmetric GAN models in asymmetric translation tasks, and propose an AsymmetricGAN model with both translation and reconstruction generators of unequal sizes and different parameter-sharing strategy to adapt to the asymmetric need in both unsupervised and supervised image-to-image translation tasks.
The proposed model consists of a single generator and a discriminator taking a conditional image and the target controllable structure as input.
Ranked #1 on Cross-View Image-to-Image Translation on cvusa
State-of-the-art methods in image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data.
Ranked #1 on Facial Expression Translation on CelebA
To alleviate this problem, researchers proposed various domain adaptation methods to improve object detection results in the cross-domain setting, e. g. by translating images with ground-truth labels from the source domain to the target domain using Cycle-GAN.
To address this problem, a possible solution is to provide the agent with information about past observations.
Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction.
Inspired by the success of adversarial learning, we propose a new end-to-end unsupervised deep learning framework for monocular depth estimation consisting of two Generative Adversarial Networks (GAN), deeply coupled with a structured Conditional Random Field (CRF) model.
In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (C$^2$GAN) for the task of keypoint-guided image generation.
Deep learning has been successfully applied to several problems related to autonomous driving.
In this work, a method for training a car detection system with annotated data from a source domain (day images) without requiring the image annotations of the target domain (night images) is presented.
In this work, we propose a novel GAN architecture that decouples the required annotations into a category label - that specifies the gesture type - and a simple-to-draw category-independent conditional map - that expresses the location, rotation and size of the hand gesture.
In this paper, we propose a novel Pattern-Affinitive Propagation (PAP) framework to jointly predict depth, surface normal and semantic segmentation.
Ranked #18 on Monocular Depth Estimation on NYU-Depth V2
Gaze correction aims to redirect the person's gaze into the camera by manipulating the eye region, and it can be considered as a specific image resynthesis problem.
Although facial landmark localization (FLL) approaches are becoming increasingly accurate for characterizing facial regions, one question remains unanswered: what is the impact of these approaches on subsequent related tasks?
To implement this idea we derive specialized deep models for each domain by adapting a pre-trained architecture but, differently from other methods, we propose a novel strategy to automatically adjust the computational complexity of the network.
Although very effective, evolutionary algorithms rely heavily on having a large population of individuals (i. e., network architectures) and is therefore memory expensive.
In this paper, we focus on the facial expression translation task and propose a novel Expression Conditional GAN (ECGAN) which can learn the mapping from one image domain to another one based on an additional expression attribute.
We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images.
Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa.
Our proposal is evaluated on the wellestablished KITTI dataset, where we show that our online method is competitive withstate of the art algorithms trained in a batch setting.
In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map.
Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed.
To handle the limitation, in this paper we propose a novel Attention-Guided Generative Adversarial Network (AGGAN), which can detect the most discriminative semantic object and minimize changes of unwanted part for semantic manipulation problems without using extra data and models.
Ranked #1 on Facial Expression Translation on Bu3dfe
Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth.
A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift.
To this end, we propose a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) which is an end-to-end framework and contains two pairs of generators and discriminators, one of which is used to generate faces with attributes while the other one is employed for image-to-sketch translation.
Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role in natural human-computer interface.
Ranked #1 on Hand Gesture Recognition on Cambridge
State-of-the-art methods for image-to-image translation with Generative Adversarial Networks (GANs) can learn a mapping from one domain to another domain using unpaired image data.
Since the advent of deep learning, neural networks have demonstrated remarkable results in many visual recognition tasks, constantly pushing the limits.
We study the factors that influence the perception of group-level cohesion and propose methods for estimating the human-perceived cohesion on the group cohesiveness scale.
This is achieved through a deep architecture that decouples appearance and motion information.
Our approach takes as input a natural image and exploits recent models for deep style transfer and generative adversarial networks to change its style in order to modify a specific high-level attribute.
Therefore, this task requires a high-level understanding of the mapping between the input source gesture and the output target gesture.
Ranked #1 on Gesture-to-Gesture Translation on NTU Hand Digit
The proposed architecture consists of two generative sub-networks jointly trained with adversarial learning for reconstructing the disparity map and organized in a cycle such as to provide mutual constraints and supervision to each other.
Extensive experiments demonstrate the effectiveness of our model that combines DNN and CRF for learning robust multi-scale local similarities.
Depth estimation and scene parsing are two particularly important tasks in visual scene understanding.
Ranked #10 on Depth Estimation on NYU-Depth V2
Recent works have shown the benefit of integrating Conditional Random Fields (CRFs) models into deep architectures for improving pixel-level prediction tasks.
In this paper we address the problem of learning robust cross-domain representations for sketch-based image retrieval (SBIR).
Depth cues have been proved very useful in various computer vision and robotic tasks.
Recent works have shown that exploiting multi-scale representations deeply learned via convolutional neural networks (CNN) is of tremendous importance for accurate contour detection.
Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose.
Ranked #5 on Gesture-to-Gesture Translation on NTU Hand Digit
In this AVEC challenge we explore different modalities (speech, language and visual features extracted from face) to design and develop automatic methods for the detection of depression.
In this paper we address the abnormality detection problem in crowded scenes.
Ranked #3 on Abnormal Event Detection In Video on UCSD Ped2
We analyze the effectiveness of four families of visual features and we discuss some human interpretable patterns that explain the personality traits of the individuals.
The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video.
Abnormal crowd behaviour detection attracts a large interest due to its importance in video surveillance scenarios.
Then, the learned feature representations are transferred to a second deep network, which receives as input an RGB image and outputs the detection results.
This paper addresses the problem of depth estimation from a single still image.
Ranked #9 on Depth Estimation on NYU-Depth V2
In this work, we show that it is possible to automatically retrieve the best style seeds for a given image, thus remarkably reducing the number of human attempts needed to find a good match.
In our overly-connected world, the automatic recognition of virality - the quality of an image or video to be rapidly and widely spread in social networks - is of crucial importance, and has recently awaken the interest of the computer vision community.
In this paper, we show that keeping track of the changes in the CNN feature across time can facilitate capturing the local abnormality.
But in a world where the preference for safe looking neighborhoods is small, the connection between the perception of safety and liveliness will be either weak or nonexistent.
Computers and Society Social and Information Networks Physics and Society
We aim at publishing the dataset with the article, to be used as a benchmark for the communities.
A very popular approach for transductive multi-label recognition under linear classification settings is matrix completion.
Recent studies in computer vision have shown that, while practically invisible to a human observer, skin color changes due to blood flow can be captured on face videos and, surprisingly, be used to estimate the heart rate (HR).
In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations.
The main idea is to iteratively select a subset of images and boxes that are the most reliable, and use them for training.
Ranked #26 on Weakly Supervised Object Detection on PASCAL VOC 2007
This is mainly because it is hard to collect data about "city life".
Computers and Society Social and Information Networks Physics and Society
The main contribution of our paper is that we use a 3D model reconstructed by a short video as the query to realize 3D-to-3D localization under a multi-task point retrieval framework.
The combination of appearance-based static ''objectness'' (Selective Search), motion information (Dense Trajectories) and transductive learning (detectors are forced to "overfit" on the unsupervised data used for training) makes the proposed approach extremely robust.
To support the ability of our method to reliably reconstruct 3D shapes, we introduce a simple method for head pose estimation using a single image that reaches higher accuracy than the state of the art.
We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes.
Studying free-standing conversational groups (FCGs) in unstructured social settings (e. g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels.
In the dataset, a massive annotation has been carried out, focusing on the spectators at different levels of details: at a higher level, people have been labeled depending on the team they are supporting and the fact that they know the people close to them; going to the lower levels, standard pose information has been considered (regarding the head, the body) but also fine grained actions such as hands on hips, clapping hands etc.
In multimedia annotation, due to the time constraints and the tediousness of manual tagging, it is quite common to utilize both tagged and untagged data to improve the performance of supervised learning when only limited tagged training data are available.
It has been shown that such object regions can be used to focus computer vision techniques on the parts of an image that matter most leading to significant improvements in both object localisation and semantic segmentation in recent years.
Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events.