Speech-to-Speech Translation

39 papers with code • 3 benchmarks • 5 datasets

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Libraries

Use these libraries to find Speech-to-Speech Translation models and implementations

Most implemented papers

Robust Speech Recognition via Large-Scale Weak Supervision

openai/whisper Preprint 2022

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.

AudioLM: a Language Modeling Approach to Audio Generation

suno-ai/bark 7 Sep 2022

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

facebookresearch/seamless_communication 22 Aug 2023

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages?

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

funaudiollm/cosyvoice 4 Jul 2024

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).

Textless Speech-to-Speech Translation With Limited Parallel Data

ajd12342/textless-s2st 24 May 2023

We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective.

Direct speech-to-speech translation with a sequence-to-sequence model

sam2125/translatotron 12 Apr 2019

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.

Towards Automatic Face-to-Face Translation

Rudrabha/LipGAN ACM Multimedia, 2019 2019

As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization.

ESPnet-ST: All-in-One Speech Translation Toolkit

espnet/espnet ACL 2020

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework.

Direct speech-to-speech translation with discrete units

rongjiehuang/transpeech ACL 2022

When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.

Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

facebookresearch/LASER NeurIPS 2021

Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl.