To explore the age effects on facial images, we propose a Disentangled Adversarial Autoencoder (DAAE) to disentangle the facial images into three independent factors: age, identity and extraneous information.
Extensive experiments on 16 IR tasks and 26 benchmarks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance.
Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task.
We present a novel task and human annotated dataset for evaluating the ability for visual-language models to generate captions and summaries for real-world video clips, which we call Video-CSR (Captioning, Summarization and Retrieval).
These analogous problems are related to the input one, with reusable solutions and problem-solving strategies.
To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes.
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping.
Visible-Infrared person re-identification (VI-ReID) is an important and challenging task in intelligent video surveillance.
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks.
Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled and synthesized features.
This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies.
To implement this benchmark, we have developed a unified framework in PyTorch, which allows for consistent evaluation and comparison of the TTA methods across the different datasets and network architectures.
When decreasing the number of sampling steps (i. e., the number of line segments used to fit the path), the ease of fitting straight lines compared to curves allows us to generate higher quality samples from a random noise with fewer iterations.
This paper proposes a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways.
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR) by accessing both the source and target data.
The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information.
Ranked #547 on Image Classification on ImageNet
Test-time adaptation (TTA), an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
Out-of-distribution (OOD) detection is a crucial aspect of deploying machine learning models in open-world applications.
To address this issue, we propose a model preprocessing framework, named AdaptGuard, to improve the security of model adaptation algorithms.
Briefly, MODIFY first trains a generative model in the target domain and then translates a source input to the target domain via the provided style model.
A relation learning module masks partial correlations between regions to reduce redundancy and then propagates the relational information across regions to capture the irregularity from a global view of the graph.
Existing cross-domain keypoint detection methods always require accessing the source data during adaptation, which may violate the data privacy law and pose serious security concerns.
In particular, self-attention with cross-scale matching and convolution filters with different kernel sizes are designed to exploit the multi-scale features in images.
STA decomposes vanilla global attention into multiplications of a sparse association map and a low-dimensional attention, leading to high efficiency in capturing global dependencies.
Then, we optimize the augmented samples by minimizing the norms of the data scores, i. e., the gradients of the log-density functions.
To reduce the training time, we further develop SAC-m that selects CutMix Augmented samples as model inputs, without the need for training the surrogate models or generating adversarial examples.
To alleviate the above problems, we propose a simple but effective method with Parallel Augmentation and Dual Enhancement (PADE) that is robust on both occluded and non-occluded data, and does not require any auxiliary clues.
Extensive experiments on both node-level and graph-level benchmarks shows that the proposed DPS achieves impressive performance for various graph domain generalization tasks.
First of all, to avoid additional parameters and explore the information in the source model, ProxyMix defines the weights of the classifier as the class prototypes and then constructs a class-balanced proxy source domain by the nearest neighbors of the prototypes to bridge the unseen source domain and the target domain.
no code implementations • 20 May 2022 • Bingzhe Wu, Jintang Li, Junchi Yu, Yatao Bian, Hengtong Zhang, Chaochao Chen, Chengbin Hou, Guoji Fu, Liang Chen, Tingyang Xu, Yu Rong, Xiaolin Zheng, Junzhou Huang, Ran He, Baoyuan Wu, Guangyu Sun, Peng Cui, Zibin Zheng, Zhe Liu, Peilin Zhao
Deep graph learning has achieved remarkable progresses in both business and scientific areas ranging from finance and e-commerce, to drug and advanced material discovery.
Deep neural networks have achieved impressive performance in a variety of tasks over the last decade, such as autonomous driving, face recognition, and medical diagnosis.
In this work, we address these key challenges and propose IFEXPLAINER, which generates a necessary and sufficient explanation for GNNs.
To achieve bilateral adaptation in the target domain, we further maximize localized mutual information to align known samples with the source classifier and employ an entropic loss to push unknown samples far away from the source classification boundary, respectively.
Ranked #2 on Universal Domain Adaptation on VisDA2017
To make further efforts on accurate and reliable iris segmentation, we propose a bilateral self-attention module and design Bilateral Transformer (BiTrans) with hierarchical architecture by exploring spatial and visual relationships.
Human face synthesis involves transferring knowledge about the identity and identity-dependent face shape (IDFS) of a human face to target face images where the context (e. g., facial expressions, head poses, and other background factors) may change dramatically.
To address this issue, we propose a memory-oriented semi-supervised (MOSS) method which enables the network to explore and exploit the properties of rain streaks from both synthetic and real data.
We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video.
In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model.
In this work, we propose a novel two-stage framework named FaceInpainter to implement controllable Identity-Guided Face Inpainting (IGFI) under heterogeneous domains.
In this paper, we address the makeup transfer and removal tasks simultaneously, which aim to transfer the makeup from a reference image to a source image and remove the makeup from the with-makeup image respectively.
We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video.
To ease the burden of labeling, unsupervised domain adaptation (UDA) aims to transfer knowledge in previous and related labeled datasets (sources) to a new unlabeled dataset (target).
We interpolate training samples at the feature level and propose a novel content loss based on the perceptual relations among samples.
The emergence of Graph Convolutional Network (GCN) has greatly boosted the progress of graph learning.
Visible-Infrared person re-identification (VI-ReID) aims to match cross-modality pedestrian images, breaking through the limitation of single-modality person ReID in dark environment.
Furthermore, we propose a new labeling transfer strategy, which separates the target data into two splits based on the confidence of predictions (labeling information), and then employ semi-supervised learning to improve the accuracy of less-confident predictions in the target domain.
Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos.
However, due to the lack of Deepfakes datasets with large variance in appearance, which can be hardly produced by recent identity swapping methods, the detection algorithm may fail in this situation.
We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip.
It is difficult for encoders to capture such powerful representations under this complex situation.
We propose a refinement stage for the pyramid features to further boost the accuracy of our network.
In this paper, we propose a framework of Graph Information Bottleneck (GIB) for the subgraph recognition problem in deep graph learning.
As a consequence, massive new diverse paired heterogeneous images with the same identity can be generated from noises.
In this paper, we rethink three freedoms of differentiable NAS, i. e. operation-level, depth-level and width-level, and propose a novel method, named Three-Freedom NAS (TF-NAS), to achieve both good classification accuracy and precise latency constraint.
LGR module utilizes body skeleton knowledge to construct a layout graph that connects all relevant part features, where graph reasoning mechanism is used to propagate information among part nodes to mine their relations.
At last, we proposed a differentiable auto data augmentation method to further improve estimation accuracy.
On one hand, negative transfer results in misclassification of target samples to the classes only present in the source domain.
Ranked #2 on Partial Domain Adaptation on DomainNet
The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio.
Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully.
The performance of multi-domain image-to-image translation has been significantly improved by recent progress in deep generative models.
A spectral conditional attention module is introduced to reduce the domain gap between NIR and VIS data and then improve the performance of NIR-VIS heterogeneous face recognition on various databases including the LAMP-HQ.
Specifically, we first introduce a dual variational autoencoder to represent a joint distribution of paired heterogeneous images.
In this paper, we address the makeup transfer task, which aims to transfer the makeup from a reference image to a source image.
Conversion of raw data into insights and knowledge requires substantial amounts of effort from data scientists.
The cross-sensor gap is one of the challenges that have aroused much research interests in Heterogeneous Face Recognition (HFR).
Recent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to low-resolution and lack of diversity.
In particular, our network is end-to-end trained and contains three subnetworks of deep features embedded by the corresponding attributes (i. e., camera view, vehicle type and vehicle color).
In this work, we present a novel training framework for GANs, namely biphasic learning, to achieve image-to-image translation in multiple visual domains at $1024^2$ resolution.
With the rapid development of deep convolutional neural network, face detection has made great progress in recent years.
UVA is the first attempt to achieve facial age analysis tasks, including age translation, age generation and age estimation, in a universal framework.
In this paper, a new large-scale Multi-yaw Multi-pitch high-quality database is proposed for Facial Pose Analysis (M2FPA), including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition.
Furthermore, due to the lack of high-resolution face manipulation databases to verify the effectiveness of our method, we collect a new high-quality Multi-View Face (MVF-HQ) database.
Then, in order to ensure the identity consistency of the generated paired heterogeneous images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space.
Ranked #1 on Face Verification on CASIA NIR-VIS 2.0
This paper models high resolution heterogeneous face synthesis as a complementary combination of two components, a texture inpainting component and pose correction component.
In this paper, we propose a deep multi-task learning framework, named as IrisParseNet, to exploit the inherent correlations between pupil, iris and sclera to boost up the performance of iris segmentation and localization in a unified model.
Ranked #1 on Iris Segmentation on CASIA
Deep learning based facial attribute analysis consists of two basic sub-issues: facial attribute estimation (FAE), which recognizes whether facial attributes are present in given images, and facial attribute manipulation (FAM), which synthesizes or removes desired facial attributes.
Talking face generation aims to synthesize a face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video via the given speech clip and facial image.
Inspired by the biological evolutionary mechanism, we propose a Coupled Evolutionary Network (CEN) with two concurrent evolutionary processes: evolutionary label distribution learning and evolutionary slack regression.
%Moreover, to achieve accurate age generation under the premise of preserving the identity information, age estimation network and face verification network are employed.
Extensive experimental results qualitatively and quantitatively demonstrate that our network is able to generate visually pleasing face completion results and edit face attributes as well.
It generates hashing bits by the output neurons of a deep hashing network.
Visible (VIS) to near infrared (NIR) face matching is a challenging problem due to the significant domain discrepancy between the domains and a lack of sufficient data for training cross-modal matching algorithms.
Ranked #2 on Face Verification on BUAA-VisNir
On the other hand, the inference model is encouraged to classify between the generated and real samples while the generator tries to fool it as GANs.
Variational capsules model an image as a composition of entities in a probabilistic model.
We decompose the prerequisite of warping into dense correspondence field estimation and facial texture map recovering, which are both well addressed by deep networks.
To utilize both global and local facial information, we propose a Global and Local Consistent Age Generative Adversarial Network (GLCA-GAN).
We treat the face completion and corruption as disentangling and fusing processes of clean faces and occlusions, and propose a jointly disentangling and fusing Generative Adversarial Network (DF-GAN).
An expression invariant face recognition experiment is also performed to further show the advantages of our proposed method.
Most modern face super-resolution methods resort to convolutional neural networks (CNN) to infer high-resolution (HR) face images.
Ranked #3 on Face Hallucination on FFHQ 512 x 512 - 16x upscaling
This paper introduces an Adversarial Occlusion-aware Face Detector (AOFD) by simultaneously detecting occluded faces and segmenting occluded areas.
Ranked #2 on Occluded Face Detection on MAFA
This paper proposes a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN).
This framework integrates cross-spectral face hallucination and discriminative feature learning into an end-to-end adversarial network.
In this method, we use the subspace representations of different views to adaptively learn a consensus similarity matrix, uncovering the subspace structure and avoiding noisy nature of original data.
To avoid the over-fitting problem on small-scale heterogeneous face data, a correlation prior is introduced on the fully-connected layers of WCNN network to reduce parameter space.
Ranked #3 on Face Verification on BUAA-VisNir
In multi-view clustering, different views may have different confidence levels when learning a consensus representation.
This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.
In this paper, we propose a novel Attention-Set based Metric Learning (ASML) method to measure the statistical characteristics of image sets.
CDL seeks a shared feature space in which the heterogeneous face matching problem can be approximately treated as a homogeneous face matching problem.
The occlusions incurred by random meshes severely degenerate the performance of face verification systems, which raises the MeshFace verification problem between MeshFace and daily photos.
In this paper, we focus on the minimizer function, and study a group of new regularizer, named self-paced implicit regularizer that is deduced from robust loss function.
Human beings often assess the aesthetic quality of an image coupled with the identification of the image's semantic content.
Ranked #5 on Aesthetics Quality Assessment on AVA
A new method called locally imposing function (LIF) is proposed to provide a local correction to the GCNN prediction function, which therefore falls within Locally Imposing Scheme (LIS).
This paper presents a Light CNN framework to learn a compact embedding on the large-scale face data with massive noisy labels.
Ranked #2 on Age-Invariant Face Recognition on CAFR
This paper presents a structured ordinal measure method for video-based face recognition that simultaneously learns ordinal filters and structured ordinal features.
For unsupervised learning, we propose a cross-modal subspace clustering method to learn a common structure for different modalities.
This work presents a systematic study of objective evaluations of abstaining classifications using Information-Theoretic Measures (ITMs).