Furthermore, to regularize the unseen target views, we constrain the rendered colors and depths from different input views to be the same.
Recent advances in text-guided video editing have showcased promising results in appearance editing (e. g., stylization).
The complexity of psychological principles underscore a significant societal challenge, given the vast social implications of psychological problems.
Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image.
In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training.
To augment the richness of 3D facial animation, we construct a new 3D dataset with detailed shapes and learn to synthesize facial details in line with speech content.
Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large.
For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks.
In this paper, we present a novel framework (EMoG) to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotion clues to guide the generation process, making the generation much easier; 2) To model joint correlation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling.
3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image.
Ranked #1 on 3D Face Reconstruction on REALY (side-view)
More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details.
Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL.
Learning a kind of feature that is both general (for AI tasks) and compact (for compression) is pivotal for its success.
Traditional media coding schemes typically encode image/video into a semantic-unknown binary stream, which fails to directly support downstream intelligent tasks at the bitstream level.
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.
Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms.
Low-light image enhancement exhibits an ill-posed nature, as a given image may have many enhanced versions, yet recent studies focus on building a deterministic mapping from input to an enhanced version.
In this paper, we propose an embarrassing simple yet highly effective adversarial domain adaptation (ADA) method for effectively training models for alignment.
Driven by the success of deep learning, the last decade has seen rapid advances in person re-identification (re-ID).
Specifically, we introduce Gait recognition as an auxiliary task to drive the Image ReID model to learn cloth-agnostic representations by leveraging personal unique and cloth-independent gait information, we name this framework as GI-ReID.
Ranked #5 on Person Re-Identification on PRCC
The CNN encoder is responsible for efficiently extracting discriminative spatial features while the DI decoder is designed to densely model spatial-temporal inherent interaction across frames.
Ranked #1 on Person Re-Identification on DukeMTMC-reID
In this paper, we figure out this issue by disentangling surveillance video into the structure of a global spatio-temporal feature (memory) for Group of Picture (GoP) and skeleton for each frame (clue).
Neural machine translation on low-resource language is challenging due to the lack of bilingual sentence pairs.
Unsupervised domain translation has recently achieved impressive performance with Generative Adversarial Network (GAN) and sufficient (unpaired) training data.
Dual learning has attracted much attention in machine learning, computer vision and natural language processing communities.
Ranked #1 on Machine Translation on WMT2016 English-German
The experimental results verify the framework's efficiency by demonstrating performance improvement of 71. 41%, 48. 28% and 52. 67% bitrate saving separately over JPEG2000, WebP and neural network-based codecs under the same face verification accuracy distortion metric.
Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures.
One key challenge to learning-based video compression is that motion predictive coding, a very effective tool for video compression, can hardly be trained into a neural network.
Multimedia Image and Video Processing