You need to log in to edit.

You can create a new account if you don't have one.

Or, discuss a change on Slack.

You can create a new account if you don't have one.

Or, discuss a change on Slack.

1 code implementation • 12 Feb 2022 • Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith

Adaptive optimization methods have become the default solvers for many machine learning tasks.

no code implementations • 2 Feb 2022 • Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models.

no code implementations • 29 Sep 2021 • Aditya Krishna Menon, Sadeep Jayasumana, Seungyeon Kim, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query.

2 code implementations • 14 Jul 2021 • Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konecny, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtarik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, Wennan Zhu

Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection.

no code implementations • AISTATS 2021 • Sashank J. Reddi, Rama Kumar Pasumarthi, Aditya Krishna Menon, Ankit Singh Rawat Felix Yu, Seungyeon Kim, Andreas Veit, Sanjiv Kumar

Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings.

no code implementations • 13 Feb 2021 • Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset.

1 code implementation • 8 Aug 2020 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.

no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

no code implementations • 21 May 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques.

1 code implementation • ICLR 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum.

no code implementations • 23 Apr 2020 • Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar

Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e. g., users of a retrieval system may be from different countries), each of which poses a challenge.

no code implementations • 20 Feb 2020 • Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabas Poczos

It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices.

no code implementations • ICML 2020 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

no code implementations • ICLR 2020 • Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.

no code implementations • NeurIPS 2020 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Sanjiv Kumar, Suvrit Sra

While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models.

4 code implementations • ICML 2020 • Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.

1 code implementation • 20 Aug 2019 • Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar

Motivated by this, differentially private stochastic gradient descent (SGD) algorithms for training machine learning models have been proposed.

2 code implementations • ICLR 2018 • Sashank J. Reddi, Satyen Kale, Sanjiv Kumar

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients.

no code implementations • 26 Jan 2019 • Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood.

no code implementations • 16 Oct 2018 • Sashank J. Reddi, Satyen Kale, Felix Yu, Dan Holtmann-Rice, Jiecao Chen, Sanjiv Kumar

Furthermore, we identify a particularly intuitive class of loss functions in the aforementioned family and show that they are amenable to practical implementation in the large output space setting (i. e. computation is possible without evaluating scores of all labels) by developing a technique called Stochastic Negative Mining.

no code implementations • 5 Sep 2017 • Sashank J. Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alexander J. Smola

A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points.

no code implementations • NeurIPS 2016 • Kumar Avinava Dubey, Sashank J. Reddi, Sinead A. Williamson, Barnabas Poczos, Alexander J. Smola, Eric P. Xing

In this paper, we present techniques for reducing variance in stochastic gradient Langevin dynamics, yielding novel stochastic Monte Carlo methods that improve performance by reducing the variance in the stochastic gradient.

no code implementations • NeurIPS 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alexander J. Smola

We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonsmooth part is convex.

no code implementations • 24 Aug 2016 • Sashank J. Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, Alex Smola

It is well known that DANE algorithm does not match the communication complexity lower bounds.

no code implementations • 27 Jul 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting.

no code implementations • NeurIPS 2016 • Hongyi Zhang, Sashank J. Reddi, Suvrit Sra

We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds.

no code implementations • 23 May 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions.

no code implementations • 19 Mar 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$.

no code implementations • 19 Mar 2016 • Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them.

no code implementations • 4 Aug 2015 • Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime.

no code implementations • NeurIPS 2015 • Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, Alex Smola

We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.

no code implementations • 23 Nov 2014 • Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textit{mean-shift} alternatives).

no code implementations • 9 Jun 2014 • Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, Larry Wasserman

This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing.

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.