Speech-to-Speech Translation
27 papers with code • 3 benchmarks • 5 datasets
Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.
Libraries
Use these libraries to find Speech-to-Speech Translation models and implementationsMost implemented papers
Speech-to-speech translation for a real-world unwritten language
We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings.
A Textless Metric for Speech-to-Speech Comparison
In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts.
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation
However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
Dialogs Re-enacted Across Languages
To support machine learning of cross-language prosodic mappings and other ways to improve speech-to-speech translation, we present a protocol for collecting closely matched pairs of utterances across languages, a description of the resulting data collection and its public release, and some observations and musings.
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community.
Textless Low-Resource Speech-to-Speech Translation With Unit Language Models
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data.