Speech-to-Speech Translation
39 papers with code • 3 benchmarks • 5 datasets
Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.
Libraries
Use these libraries to find Speech-to-Speech Translation models and implementationsMost implemented papers
Robust Speech Recognition via Large-Scale Weak Supervision
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
AudioLM: a Language Modeling Approach to Audio Generation
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages?
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).
Textless Speech-to-Speech Translation With Limited Parallel Data
We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective.
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
Towards Automatic Face-to-Face Translation
As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization.
ESPnet-ST: All-in-One Speech Translation Toolkit
We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework.
Direct speech-to-speech translation with discrete units
When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.
Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl.