no code implementations • ICML 2020 • Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.
no code implementations • ICML 2020 • Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, Ali Jadbabaie
Therefore, we introduce the notion of (delta, epsilon)-stationarity, a generalization that allows for a point to be within distance delta of an epsilon-stationary point and reduces to epsilon-stationarity for smooth functions.
no code implementations • 22 Oct 2024 • Xiang Cheng, Lawrence Carin, Suvrit Sra
We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition.
no code implementations • 8 Oct 2024 • Sanchayan Dutta, Suvrit Sra
We show that memory-augmented Transformers (Memformers) can implement linear first-order optimization methods such as conjugate gradient descent, momentum methods, and more generally, methods that linearly combine past gradients.
no code implementations • 18 Jun 2024 • Guy Kornowski, Swati Padmanabhan, Kai Wang, Zhe Zhang, Suvrit Sra
For linear equality constraints, we attain $\epsilon$-stationarity in $\widetilde{O}(\epsilon^{-2})$ gradient oracle calls, which is nearly-optimal.
no code implementations • 22 May 2024 • Sanchayan Dutta, Xiang Cheng, Suvrit Sra
We develop new algorithms for Riemannian bilevel optimization.
no code implementations • 15 Feb 2024 • Xiang Cheng, Jingzhao Zhang, Suvrit Sra
We study the task of efficiently sampling from a Gibbs distribution $d \pi^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice.
no code implementations • 11 Dec 2023 • Xiang Cheng, Yuxin Chen, Suvrit Sra
Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms.
1 code implementation • 2 Oct 2023 • Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics.
no code implementations • 10 Jul 2023 • Adarsh Barik, Suvrit Sra, Jean Honorio
Invex programs are a special kind of non-convex problems which attain global minima at every stationary point.
no code implementations • 25 May 2023 • Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra
Under this notion, we then analyze algorithms that find approximate flat minima efficiently.
no code implementations • 24 Feb 2023 • David X. Wu, Chulhee Yun, Suvrit Sra
We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence.
no code implementations • 30 Dec 2022 • Yi Tian, Kaiqing Zhang, Russ Tedrake, Suvrit Sra
We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system.
no code implementations • 22 Jun 2022 • Melanie Weber, Suvrit Sra
We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions.
no code implementations • 3 Apr 2022 • Kwangjun Ahn, Jingzhao Zhang, Suvrit Sra
Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$.
2 code implementations • 25 Feb 2022 • Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, Stefanie Jegelka
We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors.
Ranked #12 on Graph Regression on ZINC-500k
no code implementations • 13 Feb 2022 • Peiyuan Zhang, Jingzhao Zhang, Suvrit Sra
Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable.
no code implementations • 29 Dec 2021 • Ali Jadbabaie, Horia Mania, Devavrat Shah, Suvrit Sra
We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system.
1 code implementation • 21 Dec 2021 • Anshul Shah, Suvrit Sra, Rama Chellappa, Anoop Cherian
Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence.
Ranked #116 on Self-Supervised Image Classification on ImageNet
no code implementations • 4 Nov 2021 • Jikai Jin, Suvrit Sra
We contribute to advancing the understanding of Riemannian accelerated gradient methods.
no code implementations • ICLR 2022 • Chulhee Yun, Shashank Rajput, Suvrit Sra
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods.
no code implementations • 12 Oct 2021 • Jingzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie
This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks.
1 code implementation • NeurIPS 2021 • Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, Suvrit Sra
However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i. e., by inadvertently suppressing important predictive features.
no code implementations • 12 Mar 2021 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
We propose matrix norm inequalities that extend the Recht-R\'e (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-R\'e conjecture.
no code implementations • 5 Feb 2021 • Tiancheng Yu, Yi Tian, Jingzhao Zhang, Suvrit Sra
To our knowledge, this work provides the first provably efficient algorithms for vector-valued Markov games and our theoretical guarantees are near-optimal.
no code implementations • 1 Jan 2021 • Jingzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie
In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses.
no code implementations • 31 Dec 2020 • Horia Mania, Suvrit Sra
Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution.
no code implementations • 28 Oct 2020 • Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra
We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable.
1 code implementation • ICLR 2021 • Jingzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra
The label shift problem refers to the supervised learning setting where the train and test label distributions do not match.
1 code implementation • ICLR 2021 • Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, Stefanie Jegelka
How can you sample good negative examples for contrastive learning?
no code implementations • NeurIPS 2020 • Yi Tian, Jian Qian, Suvrit Sra
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components.
no code implementations • NeurIPS 2020 • Kwangjun Ahn, Chulhee Yun, Suvrit Sra
We study without-replacement SGD for solving finite-sum optimization problems.
no code implementations • 8 Jun 2020 • Jingzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie
We study oracle complexity of gradient based methods for stochastic approximation problems.
no code implementations • 17 May 2020 • Kwangjun Ahn, Suvrit Sra
The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms.
no code implementations • 18 Apr 2020 • Kwangjun Ahn, Suvrit Sra
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD.
no code implementations • ICML 2020 • Joshua Robinson, Stefanie Jegelka, Suvrit Sra
Our theoretical results are reflected empirically across a range of tasks and illustrate how weak labels speed up learning on the strong task.
no code implementations • 10 Feb 2020 • Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Ali Jadbabaie, Suvrit Sra
In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds.
no code implementations • 24 Jan 2020 • Kwangjun Ahn, Suvrit Sra
We control this distortion by developing a novel geometric inequality, which permits us to propose and analyze a Riemannian counterpart to Nesterov's accelerated gradient method.
no code implementations • NeurIPS 2020 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models.
no code implementations • 3 Dec 2019 • Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.
no code implementations • 9 Oct 2019 • Melanie Weber, Suvrit Sra
We present algorithms for both purely stochastic optimization and finite-sum problems.
no code implementations • 25 Sep 2019 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models.
no code implementations • 22 Jul 2019 • Tiancheng Yu, Suvrit Sra
A Markov Decision Process (MDP) is a popular model for reinforcement learning.
no code implementations • NeurIPS 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor.
no code implementations • 26 Jun 2019 • Tiancheng Yu, Xiyu Zhai, Suvrit Sra
The performance of a machine learning system is usually evaluated by using i. i. d.\ observations with true labels.
1 code implementation • NeurIPS 2019 • Joshua Robinson, Suvrit Sra, Stefanie Jegelka
We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance.
1 code implementation • ICLR 2020 • Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie
We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks.
no code implementations • 26 Jan 2019 • Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra
Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood.
no code implementations • 7 Dec 2018 • Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra
On the other hand, deep-RBF networks assign high confidence only to the regions containing enough feature points, but they have been discounted due to the widely-held belief that they have the vanishing gradient problem.
no code implementations • NeurIPS 2018 • Zelda E. Mariet, Suvrit Sra, Stefanie Jegelka
Strongly Rayleigh (SR) measures are discrete probability distributions over the subsets of a ground set.
no code implementations • 10 Nov 2018 • Jingzhao Zhang, Hongyi Zhang, Suvrit Sra
We study smooth stochastic optimization problems on Riemannian manifolds.
no code implementations • NeurIPS 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity.
no code implementations • ICLR 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast.
no code implementations • 26 Jun 2018 • Jeff Z. HaoChen, Suvrit Sra
We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD.
no code implementations • 7 Jun 2018 • Hongyi Zhang, Suvrit Sra
We propose a Riemannian version of Nesterov's Accelerated Gradient algorithm (RAGD), and show that for geodesically smooth and strongly convex problems, within a neighborhood of the minimizer whose radius depends on the condition number as well as the sectional curvature of the manifold, RAGD converges to the minimizer with acceleration.
no code implementations • NeurIPS 2018 • Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie
We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method.
no code implementations • CVPR 2018 • Anoop Cherian, Suvrit Sra, Stephen Gould, Richard Hartley
As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order.
no code implementations • 15 Feb 2018 • Zelda Mariet, Mike Gartrell, Suvrit Sra
To address this issue, which reduces the quality of the learned model, we introduce a novel optimization problem, Contrastive Estimation (CE), which encodes information about "negative" samples into the basic learning model.
no code implementations • ICLR 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust.
1 code implementation • 30 Oct 2017 • Melanie Weber, Suvrit Sra
Both tasks involve geodesically convex interval constraints, for which we show that the Riemannian "linear" oracle required by RFW admits a closed-form solution; this result may be of independent interest.
no code implementations • 5 Sep 2017 • Sashank J. Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alexander J. Smola
A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points.
no code implementations • ICLR 2018 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie
We study the error landscape of deep linear and nonlinear neural networks with the squared error loss.
1 code implementation • ICLR 2018 • Chengtao Li, David Alvarez-Melis, Keyulu Xu, Stefanie Jegelka, Suvrit Sra
We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination.
1 code implementation • 10 Jun 2017 • Reshad Hosseini, Suvrit Sra
This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization.
no code implementations • 24 May 2017 • Anoop Cherian, Suvrit Sra, Richard Hartley
As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in an RKHS, projections of data onto which captures their temporal order.
no code implementations • NeurIPS 2017 • Chengtao Li, Stefanie Jegelka, Suvrit Sra
We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix.
no code implementations • NeurIPS 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alexander J. Smola
We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonsmooth part is convex.
no code implementations • NeurIPS 2016 • Chengtao Li, Stefanie Jegelka, Suvrit Sra
We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them.
no code implementations • 27 Jul 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola
Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting.
1 code implementation • 18 Jul 2016 • Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra
We revisit the task of learning a Euclidean metric from data.
no code implementations • 13 Jul 2016 • Chengtao Li, Stefanie Jegelka, Suvrit Sra
In this note we consider sampling from (non-homogeneous) strongly Rayleigh probability measures.
4 code implementations • NeurIPS 2016 • Zelda Mariet, Suvrit Sra
Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of $N$ items.
no code implementations • 23 May 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola
This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions.
no code implementations • NeurIPS 2016 • Hongyi Zhang, Sashank J. Reddi, Suvrit Sra
We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds.
2 code implementations • 1 May 2016 • Suvrit Sra
The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more.
no code implementations • 7 Apr 2016 • Ke Jiang, Suvrit Sra, Brian Kulis
Topic models have emerged as fundamental tools in unsupervised machine learning.
no code implementations • 19 Mar 2016 • Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola
We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$.
no code implementations • 19 Mar 2016 • Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them.
no code implementations • 19 Mar 2016 • Chengtao Li, Stefanie Jegelka, Suvrit Sra
Its theoretical guarantees and empirical performance rely critically on the quality of the landmarks selected.
no code implementations • 19 Feb 2016 • Hongyi Zhang, Suvrit Sra
Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces.
no code implementations • 7 Dec 2015 • Chengtao Li, Suvrit Sra, Stefanie Jegelka
We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms $u^\top A^{-1}u$, where $A$ is a positive definite matrix and $u$ a given vector.
no code implementations • NeurIPS 2015 • Reshad Hosseini, Suvrit Sra
We take a new look at parameter estimation for Gaussian Mixture Model (GMMs).
2 code implementations • 16 Nov 2015 • Zelda Mariet, Suvrit Sra
We introduce Divnet, a flexible technique for learning networks with diverse neurons.
no code implementations • 4 Sep 2015 • Chengtao Li, Stefanie Jegelka, Suvrit Sra
Our method takes advantage of the diversity property of subsets sampled from a DPP, and proceeds in two stages: first it constructs coresets for the ground set of items; thereafter, it efficiently samples subsets based on the constructed coresets.
no code implementations • 20 Aug 2015 • Suvrit Sra, Adams Wei Yu, Mu Li, Alexander J. Smola
We study distributed stochastic convex optimization under the delayed gradient model where the server nodes perform parameter updates, while the worker nodes compute stochastic gradients.
no code implementations • 4 Aug 2015 • Zelda Mariet, Suvrit Sra
Determinantal point processes (DPPs) offer an elegant tool for encoding probabilities over subsets of a ground set.
no code implementations • 10 Jul 2015 • Anoop Cherian, Suvrit Sra
Inspired by the great success of dictionary learning and sparse coding for vector-valued data, our goal in this paper is to represent data in the form of SPD matrices as sparse conic combinations of SPD atoms from a learned dictionary via a Riemannian geometric approach.
no code implementations • 25 Jun 2015 • Reshad Hosseini, Suvrit Sra
We take a new look at parameter estimation for Gaussian Mixture Models (GMMs).
no code implementations • NeurIPS 2015 • Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, Alex Smola
We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.
no code implementations • 5 Mar 2015 • K. S. Sesh Kumar, Alvaro Barbero, Stefanie Jegelka, Suvrit Sra, Francis Bach
By exploiting results from convex and submodular theory, we reformulate the quadratic energy minimization problem as a total variation denoising problem, which, when viewed geometrically, enables the use of projection and reflection based convex methods.
no code implementations • NeurIPS 2014 • Adams Wei Yu, Wanli Ma, YaoLiang Yu, Jaime Carbonell, Suvrit Sra
We study the problem of finding structured low-rank matrices using nuclear norm regularization where the structure is encoded by a linear map.
3 code implementations • 3 Nov 2014 • Álvaro Barbero, Suvrit Sra
We study \emph{TV regularization}, a widely used technique for eliciting structured sparsity.
Ranked #1 on Microarray Classification on ArrayCGH
no code implementations • 17 Oct 2014 • Reshad Hosseini, Suvrit Sra, Lucas Theis, Matthias Bethge
We study modeling and inference with the Elliptical Gamma Distribution (EGD).
no code implementations • 22 Sep 2014 • Yu-Xiang Wang, Veeranjaneyulu Sadhanala, Wei Dai, Willie Neiswanger, Suvrit Sra, Eric P. Xing
We develop parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework.
no code implementations • 9 Sep 2014 • Sashank Reddi, Ahmed Hefny, Carlton Downey, Avinava Dubey, Suvrit Sra
We develop randomized (block) coordinate descent (CD) methods for linearly constrained convex optimization.
no code implementations • 1 Feb 2014 • David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, Bernhard Schölkopf
Although nonlinear variants of PCA and CCA have been proposed, these are computationally prohibitive in the large scale.
no code implementations • NeurIPS 2013 • Suvrit Sra, Reshad Hosseini
We exploit the remarkable structure of the convex cone of positive definite matrices which allows one to uncover hidden geodesic convexity of objective functions that are nonconvex in the ordinary Euclidean sense.
no code implementations • 29 Nov 2013 • Mikhail Langovoy, Suvrit Sra
Large graphs abound in machine learning, data mining, and several related areas.
no code implementations • NeurIPS 2013 • Stefanie Jegelka, Francis Bach, Suvrit Sra
A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution.
no code implementations • NeurIPS 2012 • Suvrit Sra
To our knowledge, our framework is first to develop and analyze incremental \emph{nonconvex} proximal-splitting algorithms, even if we disregard the ability to handle nonvanishing errors.
no code implementations • NeurIPS 2012 • Suvrit Sra
Symmetric positive definite (spd) matrices are remarkably pervasive in a multitude of scientific disciplines, including machine learning and optimization.
no code implementations • 8 Oct 2011 • Suvrit Sra
Positive definite matrices abound in a dazzling variety of applications.