Search Results for author: Yanping Huang

Found 20 papers, 8 papers with code

Mixture-of-Experts with Expert Choice Routing

no code implementations18 Feb 2022 Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon

Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

1 code implementation17 Feb 2022 Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus

But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning.

Natural Language Processing Question Answering +1

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

1 code implementation28 Jan 2022 Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Eric P. Xing, Yuanzhong Xu, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica

Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations.

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

no code implementations Findings (EMNLP) 2021 Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, Orhan Firat

On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1. 0 BLEU on average across 30 language pairs.

Exploring Routing Strategies for Multilingual Mixture-of-Experts Models

no code implementations1 Jan 2021 Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Thang Luong, Orhan Firat

Sparsely-Gated Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.

Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

1 code implementation NeurIPS 2020 Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, Dragomir Anguelov

The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights.

Transfer Learning

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

2 code implementations ICLR 2021 Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute.

Machine Translation Translation

Do CNNs Encode Data Augmentations?

no code implementations29 Feb 2020 Eddie Yan, Yanping Huang

To answer this question, we introduce a systematic approach to investigate which layers of neural networks are the most predictive of augmentation transformations.

Data Augmentation

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

3 code implementations21 Feb 2019 Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

Regularized Evolution for Image Classifier Architecture Search

4 code implementations5 Feb 2018 Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V. Le

The effort devoted to hand-crafting neural network image classifiers has motivated the use of architecture search to discover them automatically.

Image Classification Neural Architecture Search

Partitioning Large Scale Deep Belief Networks Using Dropout

no code implementations28 Aug 2015 Yanping Huang, Sai Zhang

Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing.

Object Recognition Speech Recognition

Learning Efficient Representations for Reinforcement Learning

no code implementations28 Aug 2015 Yanping Huang

Markov decision processes (MDPs) are a well studied framework for solving sequential decision making problems under uncertainty.

Decision Making reinforcement-learning

Cannot find the paper you are looking for? You can Submit a new open access paper.