Audio-Visual Synchronization

8 papers with code • 0 benchmarks • 3 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

v-iashin/sparsesync 13 Oct 2022

This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.

Multimodal Transformer Distillation for Audio-Visual Synchronization

vskadandale/vocalist 27 Oct 2022

This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.

Synchformer: Efficient Synchronization from Sparse Cues

v-iashin/synchformer 29 Jan 2024

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse.

Solos: A Dataset for Audio-Visual Music Analysis

JuanFMontesinos/Solos 14 Jun 2020

In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task.

Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet

maxrmorrison/clpcnet 5 Oct 2021

Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis.

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

vskadandale/vocalist 5 Apr 2022

Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end.

Target Active Speaker Detection with Audio-visual Cues

jiang-yidi/ts-talknet 22 May 2023

To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking.

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

amazon-science/avgen-eval-toolkit 10 Apr 2024

Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks.