no code implementations • 28 Nov 2023 • Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
no code implementations • 22 Nov 2023 • Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, Yi Zhang
For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication.
no code implementations • 18 Oct 2023 • Yuanzhi Li, Raghu Meka, Rina Panigrahy, Kulin Shah
Deep networks typically learn concepts via classifiers, which involves setting up a model and training it via gradient descent to fit the concept-labeled data.
2 code implementations • 2 Oct 2023 • Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li
We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents.
no code implementations • 2 Oct 2023 • Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu
Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e. g., text and images) to improve the model performance.
no code implementations • 25 Sep 2023 • Zeyuan Allen-Zhu, Yuanzhi Li
We focus on four manipulation types: retrieval (e. g., "What is person A's attribute X"), classification (e. g., "Is A's attribute X even or odd?
no code implementations • 25 Sep 2023 • Zeyuan Allen Zhu, Yuanzhi Li
Large language models can store extensive world knowledge, often extractable through question-answering (e. g., "What is Abraham Lincoln's birthday?").
no code implementations • 11 Sep 2023 • Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee
We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1. 3 billion parameter model with Python coding performance close to the state-of-the-art.
Ranked #1 on
Question Answering
on SIQA
no code implementations • 1 Sep 2023 • Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen
To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO.
no code implementations • 27 Jun 2023 • Samy Jelassi, Stéphane d'Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, François Charton
We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on $5$-digit numbers can perform $15$-digit sums.
no code implementations • 20 Jun 2023 • Yuan Cao, Difan Zou, Yuanzhi Li, Quanquan Gu
We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate.
no code implementations • 20 Jun 2023 • Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li
Despite this small scale, phi-1 attains pass@1 accuracy 50. 6% on HumanEval and 55. 5% on MBPP.
Ranked #17 on
Code Generation
on HumanEval
no code implementations • 9 Jun 2023 • Eric Luxenberg, Dhruv Malik, Yuanzhi Li, Aarti Singh, Stephen Boyd
We consider robust empirical risk minimization (ERM), where model parameters are chosen to minimize the worst-case empirical loss when each data point varies over a given convex uncertainty set.
no code implementations • 2 Jun 2023 • Binghui Li, Yuanzhi Li
However, in constrast with clean generalization, while adversarial training method is able to achieve low $\textit{robust training error}$, there still exists a significant $\textit{robust generalization gap}$, which promotes us exploring what mechanism leads to both $\textit{clean generalization and robust overfitting (CGRO)}$ during learning process.
no code implementations • 31 May 2023 • Yan Pan, Yuanzhi Li
We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms.
no code implementations • 24 May 2023 • Yue Wu, Shrimai Prabhumoye, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom Mitchell, Yuanzhi Li
Finally, we show the potential of games as a test bed for LLMs.
no code implementations • 23 May 2023 • Zeyuan Allen-Zhu, Yuanzhi Li
We design controlled experiments to study HOW generative language models, like GPT, learn context-free grammars (CFGs) -- diverse language systems with a tree-like structure capturing many aspects of natural languages, programs, and logics.
no code implementations • 12 May 2023 • Ronen Eldan, Yuanzhi Li
In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3. 5 and GPT-4.
no code implementations • 4 May 2023 • Dhruv Malik, Conor Igoe, Yuanzhi Li, Aarti Singh
Motivated by this, a significant line of work has formalized settings where an action's loss is a function of the number of times that action was recently played in the prior $m$ timesteps, where $m$ corresponds to a bound on human memory capacity.
no code implementations • 3 May 2023 • Yue Wu, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Yuanzhi Li, Tom Mitchell, Shrimai Prabhumoye
We propose the Plan, Eliminate, and Track (PET) framework.
no code implementations • 7 Apr 2023 • Yunwei Ren, Yuanzhi Li
Recently, contrastive learning approaches (e. g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e. g., image and its caption) of the same data point while keeping the representations of different data points away from each other.
1 code implementation • 22 Mar 2023 • Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang
We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models.
Ranked #13 on
Arithmetic Reasoning
on GSM8K
no code implementations • 15 Mar 2023 • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu
We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data).
1 code implementation • 7 Mar 2023 • Yuchen Li, Yuanzhi Li, Andrej Risteski
While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking.
1 code implementation • 7 Feb 2023 • Michael Santacroce, Zixin Wen, Yelong Shen, Yuanzhi Li
Auto-regressive large language models such as GPT-3 require enormous computational resources to use.
no code implementations • 13 Oct 2022 • Samy Jelassi, Michael E. Sander, Yuanzhi Li
On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association.
no code implementations • 9 Oct 2022 • Samy Jelassi, David Dobre, Arthur Mensch, Yuanzhi Li, Gauthier Gidel
By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training.
no code implementations • 22 Sep 2022 • Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, Anru R. Zhang
We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2.
2 code implementations • 4 Aug 2022 • Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li
To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.
no code implementations • 13 Jul 2022 • Samy Jelassi, Yuanzhi Li
Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures.
no code implementations • 31 May 2022 • Sitan Chen, Jerry Li, Yuanzhi Li
Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem.
no code implementations • 12 May 2022 • Zixin Wen, Yuanzhi Li
The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head.
no code implementations • 24 Apr 2022 • Dhruv Malik, Yuanzhi Li, Aarti Singh
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
no code implementations • 8 Apr 2022 • Sitan Chen, Jerry Li, Yuanzhi Li, Anru R. Zhang
Our first main result is a polynomial-time algorithm for learning quadratic transformations of Gaussians in a smoothed setting.
no code implementations • ICLR 2022 • Sitan Chen, Jerry Li, Yuanzhi Li, Raghu Meka
Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution.
1 code implementation • NeurIPS 2021 • Stefani Karp, Ezra Winston, Yuanzhi Li, Aarti Singh
We therefore propose the "local signal adaptivity" (LSA) phenomenon as one explanation for the superiority of neural networks over kernel methods.
no code implementations • 1 Nov 2021 • Yuanzhi Li, Ruosong Wang, Lin F. Yang
Notably, for an RL environment with horizon length $H$, previous work have shown that there is a probably approximately correct (PAC) algorithm that learns an $O(1)$-optimal policy using $\mathrm{polylog}(H)$ episodes of environment interactions when the number of states and actions is fixed.
no code implementations • 29 Sep 2021 • Zehao Dou, Yuanzhi Li
To the best of our knowledge, this is the very first result which provides an empirical observation and a strict theoretical guarantee on the one-sided convergence of Adam-type algorithms in min-max optimization.
no code implementations • 29 Sep 2021 • Samy Jelassi, Arthur Mensch, Gauthier Gidel, Yuanzhi Li
We empirically show that SGDA with the same vector norm as Adam reaches similar or even better performance than the latter.
no code implementations • 25 Aug 2021 • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu
In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.
33 code implementations • ICLR 2022 • Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
no code implementations • 15 Jun 2021 • Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, Yuanzhi Li
Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces.
no code implementations • 4 Jun 2021 • Zeyuan Allen-Zhu, Yuanzhi Li
Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions.
no code implementations • 31 May 2021 • Zixin Wen, Yuanzhi Li
We present an underlying principle called $\textbf{feature decoupling}$ to explain the effects of augmentations, where we theoretically characterize how augmentations can reduce the correlations of dense features between positive samples while keeping the correlations of sparse features intact, thereby forcing the neural networks to learn from the self-supervision of sparse features.
1 code implementation • ICLR 2021 • Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar
We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability.
no code implementations • NeurIPS 2021 • Dhruv Malik, Yuanzhi Li, Pradeep Ravikumar
Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem similar to the training environment.
no code implementations • 17 Dec 2020 • Zeyuan Allen-Zhu, Yuanzhi Li
Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation.
no code implementations • 30 Sep 2020 • Sébastien Bubeck, Yuanzhi Li, Dheeraj Nagaraj
We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with $k$ neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than $\sqrt{n/k}$ where $n$ is the number of datapoints.
no code implementations • 9 Jul 2020 • Yuanzhi Li, Tengyu Ma, Hongyang R. Zhang
We consider the dynamic of gradient descent for learning a two-layer neural network.
no code implementations • 20 May 2020 • Zeyuan Allen-Zhu, Yuanzhi Li
Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.
no code implementations • 9 Mar 2020 • Yuanzhi Li, Zehao Dou
In GANs, the training of the generator usually stops when the discriminator can no longer distinguish the generator's output from the set of training examples.
no code implementations • 13 Jan 2020 • Zeyuan Allen-Zhu, Yuanzhi Li
On the technical side, we show for every input dimension $d > 0$, there is a concept class of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions using "backward feature correction."
2 code implementations • NeurIPS 2019 • Yuanzhi Li, Colin Wei, Tengyu Ma
This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing.
no code implementations • NeurIPS 2019 • Sébastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron Sidford
Namely we consider optimization algorithms interacting with a highly parallel gradient oracle, that is one that can answer $\mathrm{poly}(d)$ gradient queries in parallel.
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li
Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error.
no code implementations • 28 Apr 2019 • Sébastien Bubeck, Yuanzhi Li, Yuval Peres, Mark Sellke
We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem.
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li
Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis.
no code implementations • 29 Jan 2019 • Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei
We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li, YIngyu Liang
In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations.
no code implementations • 9 Nov 2018 • Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
no code implementations • NeurIPS 2019 • Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing.
no code implementations • NeurIPS 2018 • Yuanzhi Li, YIngyu Liang
Neural networks have many successful applications, while much less theoretical understanding has been gained.
2 code implementations • ICLR 2019 • Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, Tengyu Ma
Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL.
no code implementations • ICML 2018 • Yuanzhi Li, Yoram Singer
Every regression parameter in the Lasso changes linearly as a function of the regularization value.
no code implementations • 8 Jun 2018 • Yuanzhi Li, Yoram Singer
Every regression parameter in the Lasso changes linearly as a function of the regularization value.
no code implementations • NeurIPS 2018 • Elad Hazan, Wei Hu, Yuanzhi Li, Zhiyuan Li
We revisit the question of reducing online learning to approximate optimization of the offline problem.
no code implementations • 22 Feb 2018 • Yuanzhi Li, YIngyu Liang
Mixtures of Linear Regressions (MLR) is an important mixture model with many applications.
no code implementations • ICML 2018 • Robert Kleinberg, Yuanzhi Li, Yang Yuan
Stochastic gradient descent (SGD) is widely used in machine learning.
no code implementations • ICML 2018 • Zeyuan Allen-Zhu, Sébastien Bubeck, Yuanzhi Li
Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy.
no code implementations • 26 Dec 2017 • Yuanzhi Li, Tengyu Ma, Hongyang Zhang
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations.
no code implementations • NeurIPS 2018 • Zeyuan Allen-Zhu, Yuanzhi Li
We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations.
no code implementations • 14 Nov 2017 • Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang
The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points.
no code implementations • 3 Nov 2017 • Sébastien Bubeck, Michael B. Cohen, Yuanzhi Li
In (online) learning theory the concepts of sparsity, variance and curvature are well-understood and are routinely used to obtain refined regret and generalization bounds.
no code implementations • NeurIPS 2017 • Zeyuan Allen-Zhu, Elad Hazan, Wei Hu, Yuanzhi Li
We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang
We consider computationally tractable methods for the experimental design problem, where k out of n design points of dimension p are selected so that certain optimality criteria are approximately satisfied.
no code implementations • 25 Jul 2017 • Xi Chen, Yuanzhi Li, Jieming Mao
We study the active learning problem of top-$k$ ranking from multi-wise comparisons under the popular multinomial logit model.
1 code implementation • ICML 2017 • Yuanzhi Li, YIngyu Liang
Non-negative matrix factorization is a basic tool for decomposing data into the feature and weight matrices under non-negativity constraints, and in practice is often solved in the alternating minimization framework.
no code implementations • NeurIPS 2017 • Yuanzhi Li, Yang Yuan
We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li
The online problem of computing the top eigenvector is fundamental to machine learning.
no code implementations • NeurIPS 2016 • Andrej Risteski, Yuanzhi Li
In recent years, a rapidly increasing number of applications in practice requires solving non-convex objectives, like training neural networks, learning graphical models, maximum likelihood estimation etc.
no code implementations • NeurIPS 2016 • Yuanzhi Li, YIngyu Liang, Andrej Risteski
Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li
We solve principal component regression (PCR), up to a multiplicative accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{-1})$ black-box calls of ridge regression.
no code implementations • 26 Jul 2016 • Zeyuan Allen-Zhu, Yuanzhi Li
We provide $\textit{global}$ convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for $k>1$.
no code implementations • ICML 2017 • Zeyuan Allen-Zhu, Yuanzhi Li
We study $k$-GenEV, the problem of finding the top $k$ generalized eigenvectors, and $k$-CCA, the problem of finding the top $k$ vectors in canonical-correlation analysis.
no code implementations • NeurIPS 2016 • Yuanzhi Li, Andrej Risteski
The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family, has been very popular in machine learning due to its "Occam's razor" interpretation.
no code implementations • NeurIPS 2016 • Zeyuan Allen-Zhu, Yuanzhi Li
In the $O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))$ running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.
no code implementations • 14 Mar 2016 • Elad Hazan, Yuanzhi Li
We consider the problem of online convex optimization against an arbitrary adversary with bandit feedback, known as bandit convex optimization.
no code implementations • 6 Feb 2016 • Yuanzhi Li, YIngyu Liang, Andrej Risteski
We show that the properties only need to hold in an average sense and can be achieved by the clipping step.
1 code implementation • TACL 2018 • Sanjeev Arora, Yuanzhi Li, YIngyu Liang, Tengyu Ma, Andrej Risteski
A novel aspect of our technique is that each extracted word sense is accompanied by one of about 2000 "discourse atoms" that gives a succinct description of which other words co-occur with that word sense.
4 code implementations • TACL 2016 • Sanjeev Arora, Yuanzhi Li, YIngyu Liang, Tengyu Ma, Andrej Risteski
Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods.
no code implementations • 24 Apr 2013 • Yining Wang, Li-Wei Wang, Yuanzhi Li, Di He, Tie-Yan Liu, Wei Chen
We show that NDCG with logarithmic discount has consistent distinguishability although it converges to the same limit for all ranking functions.