Speech-to-Speech Translation

27 papers with code • 3 benchmarks • 5 datasets

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Benchmarks

Add a Result

These leaderboards are used to track progress in Speech-to-Speech Translation

Dataset	Best Model	Compare
TAT	Hokkien→En (Two-pass decoding)	See all
FLEURS X-eng	SeamlessM4T Large	See all
CVSS	SeamlessM4T Large	See all

Libraries

Use these libraries to find Speech-to-Speech Translation models and implementations

facebookresearch/fairseq

2 papers

29,237

espnet/espnet

2 papers

7,871

rongjiehuang/transpeech

2 papers

157

Datasets

Most implemented papers

Most implemented Social Latest No code

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

facebookresearch/seamless_communication • • 22 Aug 2023

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages?

Paper
Code

Direct speech-to-speech translation with a sequence-to-sequence model

sam2125/translatotron • • 12 Apr 2019

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.

Paper
Code

Towards Automatic Face-to-Face Translation

Rudrabha/LipGAN • ACM Multimedia, 2019 2019

As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization.

Paper
Code

ESPnet-ST: All-in-One Speech Translation Toolkit

espnet/espnet • • ACL 2020

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework.

Paper
Code

Direct speech-to-speech translation with discrete units

rongjiehuang/transpeech • • ACL 2022

When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.

Paper
Code

Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

facebookresearch/LASER • • NeurIPS 2021

Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl.

Paper
Code

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

google-research-datasets/cvss • LREC 2022

In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech.

Paper
Code

LibriS2S: A German-English Speech-to-Speech Translation Corpus

pedrodke/libris2s • LREC 2022

In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier.

Paper
Code

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

fengpeng-yue/speech-to-speech-translation • • 18 May 2022

Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently.

Paper
Code

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

rongjiehuang/transpeech • • 25 May 2022

Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e. g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism.

Paper
Code

Speech-to-Speech Translation

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result