Speaker-Specific Lip to Speech Synthesis

3 papers with code • 7 benchmarks • 2 datasets

How accurately can we infer an individual’s speech style and content from his/her lip movements? [1]

In this task, the model is trained on a specific speaker, or a very limited set of speakers.

[1] Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis, CVPR 2020.

Most implemented papers

Densely Connected Convolutional Networks

liuzhuang13/DenseNet CVPR 2017

Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output.

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Rudrabha/Lip2Wav CVPR 2020

In this work, we explore the task of lip to speech synthesis, i. e., learning to generate natural speech given only the lip movements of a speaker.

Speech Reconstruction with Reminiscent Sound via Visual Voice Memory

joannahong/Speech-Reconstruction-with-Reminiscent-Sound-via-Visual-Voice-Memory IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021

Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features.