Speech-to-Speech Translation

27 papers with code • 3 benchmarks • 5 datasets

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Libraries

Use these libraries to find Speech-to-Speech Translation models and implementations

Most implemented papers

Speech-to-speech translation for a real-world unwritten language

facebookresearch/fairseq arXiv 2022

We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

facebookresearch/fairseq arXiv 2022

We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings.

A Textless Metric for Speech-to-Speech Comparison

besacier/textless-metric 21 Oct 2022

In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts.

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

microsoft/speecht5 31 Oct 2022

However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.

Dialogs Re-enacted Across Languages

joneavila/dral 18 Nov 2022

To support machine learning of cross-language prosodic mappings and other ways to improve speech-to-speech translation, we present a protocol for collecting closely matched pairs of utterances across languages, a description of the resulting data collection and its public release, and some observations and musings.

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

facebookresearch/fairseq 15 Dec 2022

We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.

BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

facebookresearch/stopes 16 Dec 2022

In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

plachtaa/vall-e-x 7 Mar 2023

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

espnet/espnet 10 Apr 2023

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community.

Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

ajd12342/unit-speech-translation 24 May 2023

We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data.