Given a video of a person speaking in a source language, generate a video of the same person speaking in a target language.
We introduce a data-driven approach for unsupervised video retargeting that translates content from one domain to another while preserving the style native to a domain, i. e., if contents of John Oliver's speech were to be transferred to Stephen Colbert, then the generated content/speech should be in Stephen Colbert's style.
To show this is effective, we incorporate the triple consistency loss into the training of a new landmark-guided face to face synthesis, where, contrary to previous works, the generated images can simultaneously undergo a large transformation in both expression and pose.
In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation".
SOTA for Talking Face Generation on LRW (using extra training data)