1 code implementation • NAACL (ACL) 2022 • Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff
Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.
no code implementations • 6 Oct 2024 • Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, Hiba Zayed, Mohamedou cheikh tourad, Rahaf Alhamouri, Rwaa Assi, Aisha Alraeesi, Hour Mohamed, Fakhraddin Alwajih, Abdelrahman Mohamed, Abdellah El Mekki, El Moatez Billah Nagoudi, Benelhadj Djelloul Mama Saadia, Hamzah A. Alsayadi, Walid Al-Dhabyani, Sara Shatnawi, Yasir Ech-Chammakhy, Amal Makouar, Yousra Berrachedi, Mustafa Jarrar, Shady Shehata, Ismail Berrada, Muhammad Abdul-Mageed
In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered.
no code implementations • 29 Sep 2024 • Xinyue Zhang, Jiaqi Yang, Xiangting Meng, Abdelrahman Mohamed, Laurent Kneip
In the realm of computer vision, the perception and reconstruction of the 3D world through vision signals heavily rely on camera intrinsic parameters, which have long been a subject of intense research within the community.
1 code implementation • 15 Apr 2024 • Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-Yi Lee
In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech.
1 code implementation • 25 Mar 2024 • Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.
1 code implementation • 1 Mar 2024 • Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed
Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension.
no code implementations • 24 Jan 2024 • Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee
However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered.
no code implementations • 15 Nov 2023 • Abdelrahman Mohamed, Fakhraddin Alwajih, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed
We also manually prepare a new dataset for evaluation.
1 code implementation • 16 Oct 2023 • Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black, Gopala K. Anumanchipalli
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing.
no code implementations • 16 Oct 2023 • Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K. Anumanchipalli
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks.
no code implementations • 9 Oct 2023 • Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe
The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.
no code implementations • 29 Sep 2023 • Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-Yi Lee, Abdelrahman Mohamed
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks.
1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.
2 code implementations • 19 May 2023 • Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
no code implementations • 18 May 2023 • Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.
no code implementations • 14 Dec 2022 • Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Abdelrahman Mohamed
With the development of hardware for machine learning, newer models often come at the cost of both increased sizes and computational complexity.
no code implementations • 2 Dec 2022 • Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed
Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 10 Nov 2022 • Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer
Later, we use our optimal tokenization strategy to train multiple embedding and output model to further improve our result.
no code implementations • 4 Nov 2022 • Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdelrahman Mohamed, Philip C. Woodland
Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 21 Oct 2022 • Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, Gopala K. Anumanchipalli
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks.
no code implementations • 16 Oct 2022 • Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-Yi Lee
We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency.
1 code implementation • 25 Jul 2022 • Abdelrahman Mohamed, Rushali Grandhe, K J Joseph, Salman Khan, Fahad Khan
In contrast to a recent ViT based CIL approach, our $\textrm{D}^3\textrm{Former}$ does not dynamically expand its architecture when new tasks are learned and remains suitable for a large number of incremental tasks.
1 code implementation • 29 Jun 2022 • Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed
Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 22 Jun 2022 • Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux
Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words.
no code implementations • 7 Jun 2022 • Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed
We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • 21 May 2022 • Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe
Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 15 May 2022 • Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu
This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs.
no code implementations • ACL 2022 • Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Pino
Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference.
2 code implementations • 8 Apr 2022 • Krishna Pillutla, Kshitiz Malik, Abdelrahman Mohamed, Michael Rabbat, Maziar Sanjabi, Lin Xiao
We consider two federated learning algorithms for training partially personalized models, where the shared and personal parameters are updated either simultaneously or alternately on the devices.
no code implementations • 30 Mar 2022 • Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues.
1 code implementation • ACL 2022 • Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee
In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB.
1 code implementation • 9 Mar 2022 • Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee
We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • NAACL (ACL) 2022 • Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.
no code implementations • 21 Jan 2022 • Hashmat Shadab Malik, Ikboljon Sobirov, Abdelrahman Mohamed
In this work, we investigate the impact of Faster R-CNN for aerial object detection and explore numerous strategies to improve its performance for aerial images.
2 code implementations • ICLR 2022 • Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
Ranked #3 on
Speech Recognition
on LRS3-TED
(using extra training data)
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • 5 Jan 2022 • Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
Ranked #4 on
Audio-Visual Speech Recognition
on LRS3-TED
(using extra training data)
Audio-Visual Speech Recognition
Automatic Speech Recognition
+6
no code implementations • 14 Nov 2021 • Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.
no code implementations • arXiv 2021 • Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
no code implementations • 10 Nov 2021 • Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed
With 4. 5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • ACL 2022 • Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu
Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences.
Ranked #8 on
Language Modelling
on SALMon
(using extra training data)
no code implementations • 14 Jun 2021 • Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR).
10 code implementations • 14 Jun 2021 • Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.
Ranked #5 on
Speech Recognition
on LibriSpeech test-other
6 code implementations • 3 May 2021 • Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee
SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.
2 code implementations • 1 Apr 2021 • Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
We propose using self-supervised discrete representations for the task of speech resynthesis.
no code implementations • 9 Mar 2021 • Alex Xiao, Christian Fuegen, Abdelrahman Mohamed
Pseudo-labeling is the most adopted method for pre-training automatic speech recognition (ASR) models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 31 Dec 2020 • Abdelrahman Mohamed, Dimiter Vassilev
We show that a compact quaternionic contact manifold of dimension seven that satisfies a Lichnerowicz-type lower Ricci-type bound and has the $P$-function of any eigenfunction of the sub-Laplacian non-negative achieves its smallest possible eigenvalue only if the structure is qc-Einstein.
Differential Geometry Analysis of PDEs
no code implementations • 27 Oct 2019 • Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data.
4 code implementations • 26 Apr 2019 • Abdelrahman Mohamed, Dmytro Okhonko, Luke Zettlemoyer
The recent success of transformer networks for neural machine translation and other NLP tasks has led to a surge in research work trying to apply it for speech recognition.