no code implementations • 1 Jun 2023 • Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen
Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds.
1 code implementation • 5 Apr 2022 • Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro
Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end.
1 code implementation • 8 Mar 2022 • Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro
In a second stage, the predominant voice is enhanced with an audio-only network.
2 code implementations • 20 Apr 2021 • Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro
The task of isolating a target singing voice in music videos has useful applications.
1 code implementation • 14 Jun 2020 • Juan F. Montesinos, Olga Slizovskaia, Gloria Haro
In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task.
Audio Source Separation
Audio-Visual Synchronization
+1
Audio and Speech Processing
Databases
Sound
2 code implementations • 23 Mar 2020 • Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro, Emilia Gómez
However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models.