no code implementations • 19 Dec 2024 • Austin Stone, Hagen Soltau, Robert Geirhos, Xi Yi, Ye Xia, Bingyi Cao, KaiFeng Chen, Abhijit Ogale, Jonathon Shlens
One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words.
1 code implementation • 22 May 2024 • Ying Ma, Owen Burns, Mingqiu Wang, Gang Li, Nan Du, Laurent El Shafey, Liqiang Wang, Izhak Shafran, Hagen Soltau
To alleviate the distributional mismatch issue in general self-supervised RL (SSRL), in our supervised learning (SL) stage, the agent selects actions based on the policy network and learns from generated labels; this self-generation of labels is the intuition behind the name self-supervised.
no code implementations • 2 Feb 2024 • Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM.
no code implementations • 16 Oct 2023 • Hagen Soltau, Izhak Shafran, Alex Ottenwess, Joseph R. JR Duffy, Rene L. Utianski, Leland R. Barnard, John L. Stricker, Daniela Wiepert, David T. Jones, Hugo Botha
We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 30 Sep 2023 • Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models.
no code implementations • 13 Jun 2023 • Nanxin Chen, Izhak Shafran, Yu Zhang, Chung-Cheng Chiu, Hagen Soltau, James Qin, Yonghui Wu
However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales.
no code implementations • 8 Jun 2023 • Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations.
no code implementations • 2 Mar 2023 • Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 20 Dec 2022 • Jeffrey Zhao, Yuan Cao, Raghav Gupta, Harrison Lee, Abhinav Rastogi, Mingqiu Wang, Hagen Soltau, Izhak Shafran, Yonghui Wu
We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training.
no code implementations • 16 Dec 2022 • Hagen Soltau, Izhak Shafran, Mingqiu Wang, Abhinav Rastogi, Jeffrey Zhao, Ye Jia, Wei Han, Yuan Cao, Aramys Miranda
The research on this topic is stymied by the lack of a public corpus.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 13 Oct 2022 • Dian Yu, Mingqiu Wang, Yuan Cao, Izhak Shafran, Laurent El Shafey, Hagen Soltau
Knowledge (including structured knowledge such as schema and ontology, and unstructured knowledge such as web corpus) is a critical part of dialog understanding, especially for unseen tasks and domains.
no code implementations • NAACL 2022 • Dian Yu, Mingqiu Wang, Yuan Cao, Izhak Shafran, Laurent El Shafey, Hagen Soltau
Carefully-designed schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems.
no code implementations • 8 Feb 2022 • Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey
Through empirical experiments on a challenging real-world medical NER task with multiple nested ontologies, we demonstrate that our fixed alignment model outperforms the standard RNN-T model, improving F1-score from 0. 70 to 0. 74.
no code implementations • 28 Sep 2021 • Mingqiu Wang, Hagen Soltau, Laurent El Shafey, Izhak Shafran
Confidence estimate is an often requested feature in applications such as medical transcription where errors can impact patient care and the confidence estimate could be used to alert medical professionals to verify potential errors in recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 6 Apr 2021 • Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey
Our transformer-based streaming model performs at about 20% WER on the ASR task, 6% WDER on the diarization task, 43% SER on periods, 52% SER on commas, 43% SER on question marks and 30% SER on capitalization.
no code implementations • LREC 2020 • Izhak Shafran, Nan Du, Linh Tran, Amanda Perry, Lauren Keyes, Mark Knichel, Ashley Domin, Lei Huang, Yu-Hui Chen, Gang Li, Mingqiu Wang, Laurent El Shafey, Hagen Soltau, Justin S. Paul
We used this annotation scheme to label a corpus of about 6k clinical encounters.
no code implementations • 9 Jul 2019 • Laurent El Shafey, Hagen Soltau, Izhak Shafran
The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 31 Oct 2016 • Hagen Soltau, Hank Liao, Hasim Sak
We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units.
no code implementations • 5 Sep 2013 • Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran
We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.