Audio-Visual Synchronization
8 papers with code • 0 benchmarks • 3 datasets
Benchmarks
These leaderboards are used to track progress in Audio-Visual Synchronization
Latest papers with no code
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text.
Comparative Analysis of Deep-Fake Algorithms
We examine the various deep learning-based approaches used for creating deepfakes, as well as the techniques used for detecting them.
Audio-driven Talking Face Generation by Overcoming Unintended Information Flow
Specifically, this involves unintended flow of lip, pose and other information from the reference to the generated image, as well as instabilities during model training.
On the Audio-visual Synchronization for Lip-to-Speech Synthesis
Most lip-to-speech (LTS) synthesis models are trained and evaluated under the assumption that the audio-video pairs in the dataset are perfectly synchronized.
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.
Rethinking Audio-visual Synchronization for Active Speaker Detection
This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking.
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
In this paper, we address the problem of separating individual speech signals from videos using audio-visual neural processing.
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
When watching videos, the occurrence of a visual event is often accompanied by an audio event, e. g., the voice of lip motion, the music of playing instruments.
Identity-Preserving Realistic Talking Face Generation
The necessary attributes of having a realistic face animation are 1) audio-visual synchronization (2) identity preservation of the target individual (3) plausible mouth movements (4) presence of natural eye blinks.
Realistic Speech-Driven Facial Animation with GANs
We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.