no code implementations • 16 Sep 2023 • Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara Sainath, Françoise Beaufays, Pedro Moreno Mengibar
By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale.
Overall, they present a modular, powerful and cheap alternative to the standard encoder output, as well as the N-best hypotheses.
no code implementations • 2 Mar 2023 • Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
Automatic Speech Recognition models require large amount of speech data for training, and the collection of such data often leads to privacy concerns.
This paper addresses the challenges of training large neural network models under federated learning settings: high on-device memory usage and communication cost.
We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models.
no code implementations • 11 Apr 2022 • Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays
We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones.
This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages.
Prior works have demonstrated that labels can be revealed analytically from the last layer of certain models (e. g., ResNet), or they can be reconstructed jointly with model inputs by using Gradients Matching [Zhu et al'19] with additional knowledge about the current state of the model.
This paper aims to address the major challenges of Federated Learning (FL) on edge devices: limited memory and expensive communication.
Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers.
Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result.
Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance.
These models are typically trained on the server using transcribed speech data.
no code implementations • 27 Sep 2021 • Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio.
While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns.
We show that a dropout rate of 0. 2 can reduce the speaker identity accuracy to 0% top-1 (0. 5% top-5).
This paper presents the first consumer-scale next-word prediction (NWP) model trained with Federated Learning (FL) while leveraging the Differentially Private Federated Averaging (DP-FedAvg) technique.
In this paper, we initiate a formal study to understand the effect of different components of canonical FL on unintended memorization in trained models, comparing with the central learning setting.
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models.
no code implementations • 14 Dec 2019 • Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson, Giovanni Motta, Lillian Zhou
With speech input, if the user corrects only the names, the name recall rate improves to 64. 4%.
Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe.
Federated learning is a distributed, on-device computation framework that enables training global models without exporting sensitive user data to servers.
The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of virtual keyboard.
Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and work well for a large population of speakers.
We show that a word-level recurrent neural network can predict emoji from text typed on a mobile keyboard.
We demonstrate that a character-level recurrent neural network is able to learn out-of-vocabulary (OOV) words under federated learning settings, for the purpose of expanding the vocabulary of a virtual keyboard for smartphones without exporting sensitive text to servers.
Federated learning is a distributed form of machine learning where both the training data and model training are decentralized.
We train a recurrent neural network language model using a distributed, on-device learning framework called federated learning for the purpose of next-word prediction in a virtual keyboard for smartphones.
We describe the general framework of what we call for short the keyboard "FST decoder" as well as the implementation details that are new compared to a speech FST decoder.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition.
However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks.