Search Results for author: Rohan Anil

Found 22 papers, 11 papers with code

Learning from Randomly Initialized Neural Network Features

no code implementations13 Feb 2022 Ehsan Amid, Rohan Anil, Wojciech Kotłowski, Manfred K. Warmuth

We present the surprising result that randomly initialized neural networks are good feature extractors in expectation.

Step-size Adaptation Using Exponentiated Gradient Updates

no code implementations31 Jan 2022 Ehsan Amid, Rohan Anil, Christopher Fifty, Manfred K. Warmuth

In this paper, we update the step-size scale and the gain variables with exponentiated gradient updates instead.

Learning Rate Grafting: Transferability of Optimizer Tuning

no code implementations29 Sep 2021 Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang

In the empirical science of training large neural networks, the learning rate schedule is a notoriously challenging-to-tune hyperparameter, which can depend on all other properties (architecture, optimizer, batch size, dataset, regularization, ...) of the problem.

Large-Scale Differentially Private BERT

no code implementations3 Aug 2021 Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, Pasin Manurangsi

In this work, we study the large-scale pretraining of BERT-Large with differentially private SGD (DP-SGD).

Language Modelling

LocoProp: Enhancing BackProp via Local Loss Optimization

1 code implementation11 Jun 2021 Ehsan Amid, Rohan Anil, Manfred K. Warmuth

Second-order methods have shown state-of-the-art performance for optimizing deep neural networks.

Second-order methods

Knowledge distillation: A good teacher is patient and consistent

3 code implementations CVPR 2022 Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov

In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation.

Knowledge Distillation

Information Transfer in Multi-Task Learning

no code implementations1 Jan 2021 Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, Chelsea Finn

Multi-task learning can leverage information learned by one task to benefit the training of other tasks.

Multi-Task Learning

Towards Practical Second Order Optimization for Deep Learning

no code implementations1 Jan 2021 Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, Yoram Singer

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent.

Click-Through Rate Prediction Image Classification +3

Measuring and Harnessing Transference in Multi-Task Learning

no code implementations29 Oct 2020 Christopher Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, Chelsea Finn

Multi-task learning can leverage information learned by one task to benefit the training of other tasks.

Multi-Task Learning

Stochastic Optimization with Laggard Data Pipelines

no code implementations NeurIPS 2020 Naman Agarwal, Rohan Anil, Tomer Koren, Kunal Talwar, Cyril Zhang

State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes.

Stochastic Optimization

Disentangling Adaptive Gradient Methods from Learning Rates

1 code implementation26 Feb 2020 Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang

We investigate several confounding factors in the evaluation of optimization algorithms for deep learning.

Scalable Second Order Optimization for Deep Learning

1 code implementation20 Feb 2020 Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, Yoram Singer

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent.

Image Classification Language Modelling +2

Revisiting the Generalization of Adaptive Gradient Methods

no code implementations ICLR 2020 Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang

A commonplace belief in the machine learning community is that using adaptive gradient methods hurts generalization.

Memory Efficient Adaptive Optimization

1 code implementation NeurIPS 2019 Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling.

Language Modelling Machine Translation +1

Robust Bi-Tempered Logistic Loss Based on Bregman Divergences

6 code implementations NeurIPS 2019 Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren

We introduce a temperature into the exponential function and replace the softmax output layer of neural nets by a high temperature generalization.

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

3 code implementations21 Feb 2019 Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

Memory-Efficient Adaptive Optimization

3 code implementations30 Jan 2019 Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling.

Language Modelling Machine Translation +1

Large scale distributed neural network training through online distillation

no code implementations ICLR 2018 Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton

Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made.

Language Modelling

Wide & Deep Learning for Recommender Systems

32 code implementations24 Jun 2016 Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah

Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort.

Click-Through Rate Prediction Feature Engineering +2

Cannot find the paper you are looking for? You can Submit a new open access paper.