no code implementations • 25 Jun 2024 • Ziwei Ji, Himanshu Jain, Andreas Veit, Sashank J. Reddi, Sadeep Jayasumana, Ankit Singh Rawat, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
no code implementations • 4 Jun 2024 • Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar
Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)".
no code implementations • 13 May 2023 • Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar
In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization.
1 code implementation • 1 Dec 2022 • Tian Li, Manzil Zaheer, Ken Ziyu Liu, Sashank J. Reddi, H. Brendan McMahan, Virginia Smith
Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training.
no code implementations • 8 Nov 2022 • Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabás Póczos
Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood.
no code implementations • 12 Oct 2022 • Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
1 code implementation • 12 Feb 2022 • Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith
Adaptive optimization methods have become the default solvers for many machine learning tasks.
no code implementations • 2 Feb 2022 • Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models.
no code implementations • 29 Sep 2021 • Aditya Krishna Menon, Sadeep Jayasumana, Seungyeon Kim, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query.
2 code implementations • 14 Jul 2021 • Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konecny, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtarik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, Wennan Zhu
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection.
no code implementations • AISTATS 2021 • Sashank J. Reddi, Rama Kumar Pasumarthi, Aditya Krishna Menon, Ankit Singh Rawat Felix Yu, Seungyeon Kim, Andreas Veit, Sanjiv Kumar
Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings.
no code implementations • 13 Feb 2021 • Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou
Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset.
1 code implementation • 8 Aug 2020 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.
no code implementations • 21 May 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar
In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques.
1 code implementation • ICLR 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum.
no code implementations • 23 Apr 2020 • Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar
Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e. g., users of a retrieval system may be from different countries), each of which poses a challenge.
no code implementations • 20 Feb 2020 • Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabas Poczos
It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices.
no code implementations • ICML 2020 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Attention based Transformer architecture has enabled significant advances in the field of natural language processing.
no code implementations • ICLR 2020 • Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.
no code implementations • NeurIPS 2020 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models.
7 code implementations • ICML 2020 • Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.
1 code implementation • 20 Aug 2019 • Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar
Motivated by this, differentially private stochastic gradient descent (SGD) algorithms for training machine learning models have been proposed.
3 code implementations • ICLR 2018 • Sashank J. Reddi, Satyen Kale, Sanjiv Kumar
Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients.
no code implementations • 26 Jan 2019 • Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra
Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood.
no code implementations • 16 Oct 2018 • Sashank J. Reddi, Satyen Kale, Felix Yu, Dan Holtmann-Rice, Jiecao Chen, Sanjiv Kumar
Furthermore, we identify a particularly intuitive class of loss functions in the aforementioned family and show that they are amenable to practical implementation in the large output space setting (i. e. computation is possible without evaluating scores of all labels) by developing a technique called Stochastic Negative Mining.
no code implementations • 5 Sep 2017 • Sashank J. Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alexander J. Smola
A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points.
no code implementations • NeurIPS 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alexander J. Smola
We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonsmooth part is convex.
no code implementations • NeurIPS 2016 • Kumar Avinava Dubey, Sashank J. Reddi, Sinead A. Williamson, Barnabas Poczos, Alexander J. Smola, Eric P. Xing
In this paper, we present techniques for reducing variance in stochastic gradient Langevin dynamics, yielding novel stochastic Monte Carlo methods that improve performance by reducing the variance in the stochastic gradient.
no code implementations • 24 Aug 2016 • Sashank J. Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, Alex Smola
It is well known that DANE algorithm does not match the communication complexity lower bounds.
no code implementations • 27 Jul 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola
Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting.
no code implementations • NeurIPS 2016 • Hongyi Zhang, Sashank J. Reddi, Suvrit Sra
We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds.
no code implementations • 23 May 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola
This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions.
no code implementations • 19 Mar 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola
We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$.
no code implementations • 19 Mar 2016 • Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them.
no code implementations • 4 Aug 2015 • Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman
We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime.
no code implementations • NeurIPS 2015 • Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, Alex Smola
We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.
no code implementations • 23 Nov 2014 • Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman
The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textit{mean-shift} alternatives).
no code implementations • 9 Jun 2014 • Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, Larry Wasserman
This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing.