Speech-to-Speech Translation

27 papers with code • 3 benchmarks • 5 datasets

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Benchmarks

Add a Result

These leaderboards are used to track progress in Speech-to-Speech Translation

Dataset	Best Model	Compare
TAT	Hokkien→En (Two-pass decoding)	See all
FLEURS X-eng	SeamlessM4T Large	See all
CVSS	SeamlessM4T Large	See all

Libraries

Use these libraries to find Speech-to-Speech Translation models and implementations

facebookresearch/fairseq

2 papers

29,293

espnet/espnet

2 papers

7,898

rongjiehuang/transpeech

2 papers

158

Datasets

Most implemented papers

Most implemented Social Latest No code

Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models

liamdugan/speech-to-speech • • 1 Jun 2023

Recent work in speech-to-speech translation (S2ST) has focused primarily on offline settings, where the full input utterance is available before any output is given.

Paper
Code

Towards cross-language prosody transfer for dialog

joneavila/dral • 9 Jul 2023

Speech-to-speech translation systems today do not adequately support use for dialog purposes.

Paper
Code

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

choijeongsoo/utut • • 3 Aug 2023

A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).

Paper
Code

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

ictnlp/daspeech • • NeurIPS 2023

However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models.

Paper
Code

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

choijeongsoo/av2av • • 5 Dec 2023

To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.

Paper
Code

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

facebookresearch/emphassess • • 21 Dec 2023

We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis.

Paper
Code

Speech-to-Speech Translation

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models

Towards cross-language prosody transfer for dialog

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

Content

Benchmarks

Add a Result