no code implementations • 21 Feb 2016 • Qingqing Huang, Sham M. Kakade, Weihao Kong, Gregory Valiant
When can accurate reconstruction be accomplished in the sparse data regime?
no code implementations • ICML 2018 • Maryam Fazel, Rong Ge, Sham M. Kakade, Mehran Mesbahi
Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest 3) they inherently allow for richly parameterized policies.
no code implementations • 25 Oct 2017 • Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, Aaron Sidford
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares.
no code implementations • 26 Apr 2017 • Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford
There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e. g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014.
no code implementations • ICML 2017 • Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael. I. Jordan
This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i. e., it is almost "dimension-free").
no code implementations • 1 Dec 2016 • Corinne L. Jones, Sham M. Kakade, Lucas W. Thornblade, David R. Flum, Abraham D. Flaxman
We propose using canonical correlation analysis (CCA) to generate features from sequences of medical billing codes.
no code implementations • 29 Oct 2015 • Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, Aaron Sidford
Combining our algorithm with previous work to initialize $x_0$, we obtain a number of improved sample complexity and runtime results.
no code implementations • 13 Apr 2016 • Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford
Our algorithm is linear in the input size and the number of components $k$ up to a $\log(k)$ factor.
no code implementations • NeurIPS 2016 • Chi Jin, Sham M. Kakade, Praneeth Netrapalli
While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting.
no code implementations • 26 May 2016 • Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, Aaron Sidford
We give faster algorithms and improved sample complexities for estimating the top eigenvector of a matrix $\Sigma$ -- i. e. computing a unit vector $x$ such that $x^T \Sigma x \ge (1-\epsilon)\lambda_1(\Sigma)$: Offline Eigenvector Estimation: Given an explicit $A \in \mathbb{R}^{n \times d}$ with $\Sigma = A^TA$, we show how to compute an $\epsilon$ approximate top eigenvector in time $\tilde O([nnz(A) + \frac{d*sr(A)}{gap^2} ]* \log 1/\epsilon )$ and $\tilde O([\frac{nnz(A)^{3/4} (d*sr(A))^{1/4}}{\sqrt{gap}} ] * \log 1/\epsilon )$.
no code implementations • 22 Feb 2016 • Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford
This work provides improved guarantees for streaming principle component analysis (PCA).
no code implementations • NeurIPS 2015 • Qingqing Huang, Sham M. Kakade
- The number of measurements taken by and the computational complexity of our algorithm are bounded by a polynomial in both the number of points k and the dimension d, with no dependence on the separation \Delta.
no code implementations • 24 Jun 2015 • Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford
We develop a family of accelerated stochastic algorithms that minimize sums of convex functions.
no code implementations • 31 Oct 2009 • Sham M. Kakade, Ohad Shamir, Karthik Sridharan, Ambuj Tewari
The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model.
no code implementations • 2 Mar 2015 • Rong Ge, Qingqing Huang, Sham M. Kakade
Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case.
no code implementations • 20 Dec 2014 • Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford
In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the $M$-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties.
no code implementations • 29 Oct 2012 • Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order).
no code implementations • 13 Jun 2011 • Daniel Hsu, Sham M. Kakade, Tong Zhang
The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting.
no code implementations • 12 Feb 2013 • Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade
We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method.
no code implementations • 7 Oct 2013 • Alekh Agarwal, Sham M. Kakade, Nikos Karampatziakis, Le Song, Gregory Valiant
This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large.
no code implementations • 4 May 2011 • Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar
We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a Principal Component Analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace.
no code implementations • 24 Sep 2012 • Animashree Anandkumar, Daniel Hsu, Adel Javanmard, Sham M. Kakade
The sufficient conditions for identifiability of these models are primarily based on weak expansion constraints on the topic-word matrix, for topic models, and on the directed acyclic graph, for Bayesian networks.
no code implementations • 20 Nov 2018 • John Thickstun, Zaid Harchaoui, Dean P. Foster, Sham M. Kakade
This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music.
no code implementations • NeurIPS 2018 • Sham M. Kakade, Jason D. Lee
The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.
no code implementations • NeurIPS 2015 • Kamalika Chaudhuri, Sham M. Kakade, Praneeth Netrapalli, Sujay Sanghavi
Provided certain conditions hold on the model class, we provide a two-stage active learning algorithm for this problem.
no code implementations • NeurIPS 2012 • Anima Anandkumar, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Yi-Kai Liu
This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA).
no code implementations • NeurIPS 2012 • Anima Anandkumar, Daniel J. Hsu, Furong Huang, Sham M. Kakade
We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables.
no code implementations • NeurIPS 2012 • Daniel J. Hsu, Sham M. Kakade, Percy S. Liang
This paper explores unsupervised learning of parsing models along two directions.
no code implementations • NeurIPS 2011 • Sham M. Kakade, Varun Kanade, Ohad Shamir, Adam Kalai
In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient.
no code implementations • NeurIPS 2011 • Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin
This paper addresses the problem of minimizing a convex, Lipschitz function $f$ over a convex, compact set $X$ under a stochastic bandit feedback model.
no code implementations • NeurIPS 2011 • Animashree Anandkumar, Kamalika Chaudhuri, Daniel J. Hsu, Sham M. Kakade, Le Song, Tong Zhang
The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i. e., the graph of how the underlying hidden variables are connected to each other and to the observed variables).
no code implementations • NeurIPS 2008 • Sham M. Kakade, Karthik Sridharan, Ambuj Tewari
We provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes.
no code implementations • NeurIPS 2008 • Sham M. Kakade, Ambuj Tewari
This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex.
no code implementations • NeurIPS 2008 • Shai Shalev-Shwartz, Sham M. Kakade
We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms.
no code implementations • ICLR 2019 • Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli
One plausible explanation is that non-convex neural network training procedures are better suited to the use of fundamentally different learning rate schedules, such as the ``cut the learning rate every constant number of epochs'' method (which more closely resembles an exponentially decaying learning rate schedule); note that this widely used schedule is in stark contrast to the polynomial decay schemes prescribed in the stochastic approximation literature, which are indeed shown to be (worst case) optimal for classes of convex optimization problems.
no code implementations • ICLR 2018 • Maryam Fazel, Rong Ge, Sham M. Kakade, Mehran Mesbahi
Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model; 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest; 3) they inherently allow for richly parameterized policies.
no code implementations • 11 Feb 2019 • Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael. I. Jordan
In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.
no code implementations • 12 Feb 2019 • Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, Sham M. Kakade
Precisely, for sufficiently large $N$, the MLE achieves the information theoretic optimal error bound of $\mathcal{O}(\frac{1}{t})$ for $t < c\log{N}$, with regards to the earth mover's distance (between the estimated and true distributions).
no code implementations • 13 Feb 2019 • Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael. I. Jordan
More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial.
no code implementations • 23 Feb 2019 • Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, Karan Singh
We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise).
no code implementations • ICML 2020 • Mark Braverman, Xinyi Chen, Sham M. Kakade, Karthik Narasimhan, Cyril Zhang, Yi Zhang
Building accurate language models that capture meaningful long-term dependencies is a core challenge in natural language processing.
no code implementations • 1 Aug 2019 • Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces.
no code implementations • ICLR 2020 • Simon S. Du, Sham M. Kakade, Ruosong Wang, Lin F. Yang
With regards to the statistical viewpoint, this question is largely unexplored, and the extant body of literature mainly focuses on conditions which permit sample efficient reinforcement learning with little understanding of what are necessary conditions for efficient reinforcement learning.
no code implementations • 27 Nov 2019 • Elad Hazan, Sham M. Kakade, Karan Singh
We consider the problem of controlling an unknown linear dynamical system in the presence of (nonstochastic) adversarial perturbations and adversarial convex loss functions.
no code implementations • 28 Nov 2019 • Ramya Korlakai Vinayak, Weihao Kong, Sham M. Kakade
Provided these paired observations, $\{(X_i, Y_i) \}_{i=1}^N$, our goal is to accurately estimate the \emph{distribution of the change in parameters}, $\delta_i := q_i - p_i$, over the population and properties of interest like the \emph{$\ell_1$-magnitude of the change} with sparse observations ($t\ll N$).
no code implementations • ICLR 2021 • Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei
First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation.
no code implementations • 1 May 2020 • Ruosong Wang, Simon S. Du, Lin F. Yang, Sham M. Kakade
Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class.
no code implementations • NeurIPS 2020 • Chi Jin, Sham M. Kakade, Akshay Krishnamurthy, Qinghua Liu
Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration.
no code implementations • NeurIPS 2020 • Kaiqing Zhang, Sham M. Kakade, Tamer Başar, Lin F. Yang
This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence.
Model-based Reinforcement Learning Reinforcement Learning (RL)
no code implementations • ICLR 2021 • Ruosong Wang, Dean Foster, Sham M. Kakade
Function approximation methods coupled with batch reinforcement learning (or off-policy reinforcement learning) are providing an increasingly important framework to help alleviate the excessive sample complexity burden in modern reinforcement learning problems.
no code implementations • 22 Oct 2020 • Ruosong Wang, Dean P. Foster, Sham M. Kakade
Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies.
no code implementations • 26 Nov 2008 • Daniel Hsu, Sham M. Kakade, Tong Zhang
Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series.
no code implementations • 8 Mar 2021 • Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, Sham M. Kakade
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated.
no code implementations • 19 Mar 2021 • Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang
The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the optimal $Q$-function and the optimal $V$-function are linear in some known feature space.
no code implementations • 23 Mar 2021 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors).
no code implementations • NeurIPS 2021 • Yuanhao Wang, Ruosong Wang, Sham M. Kakade
This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal $Q$-function.
1 code implementation • 3 Mar 2012 • Animashree Anandkumar, Daniel Hsu, Sham M. Kakade
Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations.
no code implementations • 6 Jul 2021 • Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei
Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning.
no code implementations • NeurIPS 2021 • Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang
This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem.
no code implementations • NeurIPS 2021 • Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang
While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions.
no code implementations • NeurIPS 2021 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade
Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches.
no code implementations • 12 Oct 2021 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems.
no code implementations • NeurIPS 2021 • Yuanhao Wang, Ruosong Wang, Sham M. Kakade
The recent and remarkable result of Weisz et al. (2020) resolves this question in the negative, providing an exponential (in $d$) sample size lower bound, which holds even if the agent has access to a generative model of the environment.
no code implementations • 27 Dec 2021 • Dylan J. Foster, Sham M. Kakade, Jian Qian, Alexander Rakhlin
The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning.
no code implementations • 7 Mar 2022 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
no code implementations • 3 Aug 2022 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data.
no code implementations • 6 Oct 2022 • Dhruv Madeka, Kari Torkkola, Carson Eisenach, Anna Luo, Dean P. Foster, Sham M. Kakade
This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching.
no code implementations • 9 Oct 2022 • Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, Sham M. Kakade
Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning.
no code implementations • 18 Oct 2022 • Abhishek Gupta, Aldo Pacchiano, Yuexiang Zhai, Sham M. Kakade, Sergey Levine
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications, but in practice the choice of reward function can be crucial for good results -- while in principle the reward only needs to specify what the task is, in reality practitioners often need to design more detailed rewards that provide the agent with some hints about how the task should be completed.
no code implementations • 28 Feb 2023 • Sham M. Kakade, Akshay Krishnamurthy, Gaurav Mahajan, Cyril Zhang
In this paper, we depart from this setup and consider an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs.
no code implementations • 3 Mar 2023 • Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
On the other hand, we provide some negative results for stochastic gradient descent (SGD) for ReLU regression with symmetric Bernoulli data: if the model is well-specified, the excess risk of SGD is provably no better than that of GLM-tron ignoring constant factors, for each problem instance; and in the noiseless case, GLM-tron can achieve a small risk while SGD unavoidably suffers from a constant risk in expectation.
no code implementations • 22 Mar 2023 • Dylan J. Foster, Noah Golowich, Sham M. Kakade
They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute a coarse correlated equilibrium that is sparse in the sense that it can be represented as a mixture of a small number of product policies.
no code implementations • 18 Apr 2024 • Yiwen Kou, Zixiang Chen, Quanquan Gu, Sham M. Kakade
We then demonstrate how a trained neural network with SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors.
1 code implementation • NeurIPS 2018 • Krishna Pillutla, Vincent Roulet, Sham M. Kakade, Zaid Harchaoui
We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon.
2 code implementations • 21 Dec 2009 • Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias Seeger
Many applications require optimizing an unknown, noisy function that is expensive to evaluate.
2 code implementations • 6 Dec 2018 • Elad Hazan, Sham M. Kakade, Karan Singh, Abby Van Soest
Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do?
1 code implementation • 13 Nov 2017 • John Thickstun, Zaid Harchaoui, Dean Foster, Sham M. Kakade
This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings.
2 code implementations • arXiv preprint 2019 • Krishna Pillutla, Sham M. Kakade, Zaid Harchaoui
We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server.
1 code implementation • 1 Feb 2024 • Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context.
1 code implementation • 12 Oct 2016 • Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford
In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate.
2 code implementations • ICLR 2018 • Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, Sham M. Kakade
Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.
1 code implementation • NeurIPS 2019 • Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli
First, this work shows that even if the time horizon T (i. e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of $\sqrt{T}$ in the non-strongly convex case).