Search Results for author: Zeyuan Allen-Zhu

Found 43 papers, 4 papers with code

Reverse Training to Nurse the Reversal Curse

no code implementations20 Mar 2024 Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar

Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse.

Physics of Language Models: Part 3.2, Knowledge Manipulation

no code implementations25 Sep 2023 Zeyuan Allen-Zhu, Yuanzhi Li

We focus on four manipulation types: retrieval (e. g., "What is person A's attribute X"), classification (e. g., "Is A's attribute X even or odd?

Attribute Language Modelling +2

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

no code implementations25 Sep 2023 Zeyuan Allen-Zhu, Yuanzhi Li

This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.

Question Answering Sentence +1

Physics of Language Models: Part 1, Context-Free Grammar

no code implementations23 May 2023 Zeyuan Allen-Zhu, Yuanzhi Li

We design controlled experiments to study HOW generative language models, like GPT, learn context-free grammars (CFGs) -- diverse language systems with a tree-like structure capturing many aspects of natural languages, programs, and logics.

LoRA: Low-Rank Adaptation of Large Language Models

44 code implementations ICLR 2022 Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

Language Modelling

Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

no code implementations4 Jun 2021 Zeyuan Allen-Zhu, Yuanzhi Li

Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions.

Super-Resolution

Byzantine-Resilient Non-Convex Stochastic Gradient Descent

no code implementations ICLR 2021 Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh

We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions.

Distributed Optimization

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

no code implementations17 Dec 2020 Zeyuan Allen-Zhu, Yuanzhi Li

Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation.

Knowledge Distillation Learning Theory

Feature Purification: How Adversarial Training Performs Robust Deep Learning

no code implementations20 May 2020 Zeyuan Allen-Zhu, Yuanzhi Li

Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.

Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning

no code implementations13 Jan 2020 Zeyuan Allen-Zhu, Yuanzhi Li

On the technical side, we show for every input dimension $d > 0$, there is a concept class of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions using "backward feature correction."

Binary Classification

What Can ResNet Learn Efficiently, Going Beyond Kernels?

no code implementations NeurIPS 2019 Zeyuan Allen-Zhu, Yuanzhi Li

Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error.

One-Shot Learning

The Lingering of Gradients: Theory and Applications

no code implementations NeurIPS 2018 Zeyuan Allen-Zhu, David Simchi-Levi, Xinshang Wang

Classically, the time complexity of a first-order method is estimated by its number of gradient computations.

Management

The Lingering of Gradients: How to Reuse Gradients Over Time

no code implementations NeurIPS 2018 Zeyuan Allen-Zhu, David Simchi-Levi, Xinshang Wang

Classically, the time complexity of a first-order method is estimated by its number of gradient computations.

Management

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

no code implementations NeurIPS 2019 Zeyuan Allen-Zhu, Yuanzhi Li, YIngyu Liang

In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations.

Learning Theory Vocal Bursts Valence Prediction

A Convergence Theory for Deep Learning via Over-Parameterization

no code implementations9 Nov 2018 Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

On the Convergence Rate of Training Recurrent Neural Networks

no code implementations NeurIPS 2019 Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing.

Is Q-learning Provably Efficient?

1 code implementation NeurIPS 2018 Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael. I. Jordan

We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tilde{O}(\sqrt{H^3 SAT})$, where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps.

Q-Learning Reinforcement Learning (RL)

Katyusha X: Simple Momentum Method for Stochastic Sum-of-Nonconvex Optimization

no code implementations ICML 2018 Zeyuan Allen-Zhu

The problem of minimizing sum-of-nonconvex functions (i. e., convex functions that are average of non-convex ones) is becoming increasing important in machine learning, and is the core machinery for PCA, SVD, regularized Newton’s method, accelerated non-convex optimization, and more.

Byzantine Stochastic Gradient Descent

no code implementations NeurIPS 2018 Dan Alistarh, Zeyuan Allen-Zhu, Jerry Li

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.

Stochastic Optimization

Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization

no code implementations12 Feb 2018 Zeyuan Allen-Zhu

The problem of minimizing sum-of-nonconvex functions (i. e., convex functions that are average of non-convex ones) is becoming increasingly important in machine learning, and is the core machinery for PCA, SVD, regularized Newton's method, accelerated non-convex optimization, and more.

Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

no code implementations ICML 2018 Zeyuan Allen-Zhu, Sébastien Bubeck, Yuanzhi Li

Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy.

Multi-Armed Bandits

How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

no code implementations NeurIPS 2018 Zeyuan Allen-Zhu

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$.

Neon2: Finding Local Minima via First-Order Oracles

no code implementations NeurIPS 2018 Zeyuan Allen-Zhu, Yuanzhi Li

We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations.

Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach

no code implementations14 Nov 2017 Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang

The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points.

Experimental Design

Natasha 2: Faster Non-Convex Optimization Than SGD

no code implementations NeurIPS 2018 Zeyuan Allen-Zhu

We design a stochastic algorithm to train any smooth neural network to $\varepsilon$-approximate local minima, using $O(\varepsilon^{-3. 25})$ backpropagations.

Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

no code implementations NeurIPS 2017 Zeyuan Allen-Zhu, Elad Hazan, Wei Hu, Yuanzhi Li

We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball.

Near-Optimal Design of Experiments via Regret Minimization

no code implementations ICML 2017 Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang

We consider computationally tractable methods for the experimental design problem, where k out of n design points of dimension p are selected so that certain optimality criteria are approximately satisfied.

Experimental Design

Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter

no code implementations ICML 2017 Zeyuan Allen-Zhu

Given a nonconvex function that is an average of $n$ smooth functions, we design stochastic first-order methods to find its approximate stationary points.

Stochastic Optimization

Finding Approximate Local Minima Faster than Gradient Descent

1 code implementation3 Nov 2016 Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, Tengyu Ma

We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples.

BIG-bench Machine Learning

Faster Principal Component Regression and Stable Matrix Chebyshev Approximation

no code implementations ICML 2017 Zeyuan Allen-Zhu, Yuanzhi Li

We solve principal component regression (PCR), up to a multiplicative accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{-1})$ black-box calls of ridge regression.

regression

First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate

no code implementations26 Jul 2016 Zeyuan Allen-Zhu, Yuanzhi Li

We provide $\textit{global}$ convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for $k>1$.

Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition

no code implementations ICML 2017 Zeyuan Allen-Zhu, Yuanzhi Li

We study $k$-GenEV, the problem of finding the top $k$ generalized eigenvectors, and $k$-CCA, the problem of finding the top $k$ vectors in canonical-correlation analysis.

LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain

no code implementations NeurIPS 2016 Zeyuan Allen-Zhu, Yuanzhi Li

In the $O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))$ running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods

no code implementations18 Mar 2016 Zeyuan Allen-Zhu

However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex and finite-sum.

Stochastic Optimization

Optimal Black-Box Reductions Between Optimization Objectives

no code implementations NeurIPS 2016 Zeyuan Allen-Zhu, Elad Hazan

The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand.

BIG-bench Machine Learning General Classification +1

Variance Reduction for Faster Non-Convex Optimization

no code implementations17 Mar 2016 Zeyuan Allen-Zhu, Elad Hazan

We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point.

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

no code implementations30 Dec 2015 Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, Yang Yuan

Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems.

Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates

no code implementations16 Jun 2015 Zeyuan Allen-Zhu, Zhenyu Liao, Lorenzo Orecchia

In this paper, we provide a novel construction of the linear-sized spectral sparsifiers of Batson, Spielman and Srivastava [BSS14].

Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives

3 code implementations5 Jun 2015 Zeyuan Allen-Zhu, Yang Yuan

Many classical algorithms are found until several years later to outlive the confines in which they were conceived, and continue to be relevant in unforeseen settings.

regression

Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent

no code implementations6 Jul 2014 Zeyuan Allen-Zhu, Lorenzo Orecchia

First-order methods play a central role in large-scale machine learning.

Cannot find the paper you are looking for? You can Submit a new open access paper.