What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Speech-to-Text Translation CoVoST 2 eng-X SeamlessM4T Large BLEU 30.6 # 1
Speech-to-Text Translation CoVoST 2 eng-X SeamlessM4T Medium BLEU 26.6 # 2
Speech-to-Text Translation CoVoST 2 X-eng SeamlessM4T Large BLEU 34.1 # 1
Speech-to-Text Translation CoVoST 2 X-eng SeamlessM4T Medium BLEU 29.8 # 2
Speech-to-Speech Translation CVSS SeamlessM4T Large ASR-BLEU 36.5 # 1
Parameters 2.3B # 1
Speech-to-Speech Translation CVSS SeamlessM4T Medium ASR-BLEU 28.1 # 2
Parameters 1.2B # 1
Automatic Speech Recognition FLEURS SeamlessM4T Large Parameters 2.3B # 1
Word Error Rate (WER) 23.1 # 2
Automatic Speech Recognition FLEURS SeamlessM4T Medium Parameters 1.2B # 1
Word Error Rate (WER) 21.9 # 1
Automatic Speech Recognition FLEURS-54 SeamlessM4T Large Word Error Rate (WER) 23.7 # 2
Automatic Speech Recognition FLEURS-54 SeamlessM4T Medium Word Error Rate (WER) 22 # 1
Speech-to-Text Translation FLEURS eng-X SeamlessM4T Large BLEU 21.5 # 1
Speech-to-Text Translation FLEURS eng-X SeamlessM4T Medium BLEU 19.2 # 2
Speech-to-Text Translation FLEURS X-eng SeamlessM4T Medium BLEU 20.9 # 2
Speech-to-Speech Translation FLEURS X-eng SeamlessM4T Large ASR-BLEU 25.8 # 1
Speech-to-Speech Translation FLEURS X-eng SeamlessM4T Medium ASR-BLEU 20.4 # 2
Speech-to-Text Translation FLEURS X-eng SeamlessM4T Large BLEU 24.0 # 1
Machine Translation flores95-devtest eng-X SeamlessM4T Large ChrF++ 50.9 # 1
Machine Translation flores95-devtest eng-X SeamlessM4T-NLLB-1.3B ChrF++ 49.6 # 2
Machine Translation flores95-devtest eng-X SeamlessM4T Medium ChrF++ 48.4 # 3
Machine Translation flores95-devtest X-eng SeamlessM4T-NLLB-1.3B ChrF++ 60.7 # 2
Machine Translation flores95-devtest X-eng SeamlessM4T Medium ChrF++ 55.4 # 3
Machine Translation flores95-devtest X-eng SeamlessM4T Large ChrF++ 60.8 # 1

Methods


No methods listed for this paper. Add relevant methods here