no code implementations • 25 Sep 2023 • Zeyuan Allen-Zhu, Yuanzhi Li
We focus on four manipulation types: retrieval (e. g., "What is person A's attribute X"), classification (e. g., "Is A's attribute X even or odd?
no code implementations • 23 May 2023 • Zeyuan Allen-Zhu, Yuanzhi Li
We design controlled experiments to study HOW generative language models, like GPT, learn context-free grammars (CFGs) -- diverse language systems with a tree-like structure capturing many aspects of natural languages, programs, and logics.
33 code implementations • ICLR 2022 • Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
no code implementations • 4 Jun 2021 • Zeyuan Allen-Zhu, Yuanzhi Li
Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions.
no code implementations • ICLR 2021 • Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh
We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions.
no code implementations • 17 Dec 2020 • Zeyuan Allen-Zhu, Yuanzhi Li
Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation.
no code implementations • 20 May 2020 • Zeyuan Allen-Zhu, Yuanzhi Li
Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.
no code implementations • 13 Jan 2020 • Zeyuan Allen-Zhu, Yuanzhi Li
On the technical side, we show for every input dimension $d > 0$, there is a concept class of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions using "backward feature correction."
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li
Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error.
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li
Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis.
no code implementations • NeurIPS 2018 • Zeyuan Allen-Zhu, David Simchi-Levi, Xinshang Wang
Classically, the time complexity of a first-order method is estimated by its number of gradient computations.
no code implementations • NeurIPS 2018 • Zeyuan Allen-Zhu, David Simchi-Levi, Xinshang Wang
Classically, the time complexity of a first-order method is estimated by its number of gradient computations.
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li, YIngyu Liang
In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations.
no code implementations • 9 Nov 2018 • Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing.
no code implementations • NeurIPS 2018 • Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael. I. Jordan
We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tilde{O}(\sqrt{H^3 SAT})$, where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps.
no code implementations • ICML 2018 • Zeyuan Allen-Zhu
The problem of minimizing sum-of-nonconvex functions (i. e., convex functions that are average of non-convex ones) is becoming increasing important in machine learning, and is the core machinery for PCA, SVD, regularized Newton’s method, accelerated non-convex optimization, and more.
no code implementations • NeurIPS 2018 • Dan Alistarh, Zeyuan Allen-Zhu, Jerry Li
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.
no code implementations • 12 Feb 2018 • Zeyuan Allen-Zhu
The problem of minimizing sum-of-nonconvex functions (i. e., convex functions that are average of non-convex ones) is becoming increasingly important in machine learning, and is the core machinery for PCA, SVD, regularized Newton's method, accelerated non-convex optimization, and more.
no code implementations • ICML 2018 • Zeyuan Allen-Zhu, Sébastien Bubeck, Yuanzhi Li
Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy.
no code implementations • NeurIPS 2018 • Zeyuan Allen-Zhu
Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$.
no code implementations • NeurIPS 2018 • Zeyuan Allen-Zhu, Yuanzhi Li
We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations.
no code implementations • 14 Nov 2017 • Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang
The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points.
no code implementations • NeurIPS 2018 • Zeyuan Allen-Zhu
We design a stochastic algorithm to train any smooth neural network to $\varepsilon$-approximate local minima, using $O(\varepsilon^{-3. 25})$ backpropagations.
no code implementations • NeurIPS 2017 • Zeyuan Allen-Zhu, Elad Hazan, Wei Hu, Yuanzhi Li
We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang
We consider computationally tractable methods for the experimental design problem, where k out of n design points of dimension p are selected so that certain optimality criteria are approximately satisfied.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu
Given a nonconvex function that is an average of $n$ smooth functions, we design stochastic first-order methods to find its approximate stationary points.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li
The online problem of computing the top eigenvector is fundamental to machine learning.
1 code implementation • 3 Nov 2016 • Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, Tengyu Ma
We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li
We solve principal component regression (PCR), up to a multiplicative accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{-1})$ black-box calls of ridge regression.
no code implementations • 26 Jul 2016 • Zeyuan Allen-Zhu, Yuanzhi Li
We provide $\textit{global}$ convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for $k>1$.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li
We study $k$-GenEV, the problem of finding the top $k$ generalized eigenvectors, and $k$-CCA, the problem of finding the top $k$ vectors in canonical-correlation analysis.
no code implementations • NeurIPS 2016 • Zeyuan Allen-Zhu, Yuanzhi Li
In the $O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))$ running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.
no code implementations • 18 Mar 2016 • Zeyuan Allen-Zhu
However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex and finite-sum.
no code implementations • NeurIPS 2016 • Zeyuan Allen-Zhu, Elad Hazan
The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand.
no code implementations • 17 Mar 2016 • Zeyuan Allen-Zhu, Elad Hazan
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point.
no code implementations • NeurIPS 2016 • Zeyuan Allen-Zhu, Yang Yuan, Karthik Sridharan
The amount of data available in the world is growing faster than our ability to deal with it.
no code implementations • 30 Dec 2015 • Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, Yang Yuan
Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems.
no code implementations • 16 Jun 2015 • Zeyuan Allen-Zhu, Zhenyu Liao, Lorenzo Orecchia
In this paper, we provide a novel construction of the linear-sized spectral sparsifiers of Batson, Spielman and Srivastava [BSS14].
3 code implementations • 5 Jun 2015 • Zeyuan Allen-Zhu, Yang Yuan
Many classical algorithms are found until several years later to outlive the confines in which they were conceived, and continue to be relevant in unforeseen settings.
no code implementations • 6 Jul 2014 • Zeyuan Allen-Zhu, Lorenzo Orecchia
First-order methods play a central role in large-scale machine learning.