Textless speech-to-speech translation systems are rapidly advancing, thanks to the integration of self-supervised learning techniques.
This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language.
We propose a method for speech-to-speech emotionpreserving translation that operates at the level of discrete speech units.
Our approach is based on the use of an external model trained to generate a sequence of vectorial representations from text.
The x-vector architecture has recently achieved state-of-the-art results on the speaker verification task.