Search Results for author: Yuanzhi Li

Found 108 papers, 23 papers with code

Mixture of Parrots: Experts improve memorization more than reasoning

no code implementations24 Oct 2024 Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach

On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data.

Math Memorization

LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks

1 code implementation16 Oct 2024 Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, Samy Jelassi

We study how different LoRA modules can be merged to achieve skill composition -- testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA.

Math parameter-efficient fine-tuning

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

no code implementations11 Oct 2024 Binghui Li, Yuanzhi Li

Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense.

Learning Theory

O1 Replication Journey: A Strategic Progress Report -- Part 1

1 code implementation8 Oct 2024 Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, PengFei Liu

This paper introduces a pioneering approach to artificial intelligence research, embodied in our O1 Replication Journey.

Math scientific discovery

Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

no code implementations2 Sep 2024 Youngseog Chung, Dhruv Malik, Jeff Schneider, Yuanzhi Li, Aarti Singh

The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts.

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

no code implementations29 Aug 2024 Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu

Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes.

Math

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

no code implementations29 Jul 2024 Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu

We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reasoning skills, or do they simply memorize templates?

GSM8K Math +1

How Does Overparameterization Affect Features?

no code implementations1 Jul 2024 Ahmet Cagri Duzgun, Samy Jelassi, Yuanzhi Li

We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

no code implementations22 Apr 2024 Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, Allie Del Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, Sam Ade Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, Young Jin Kim, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, Ce Liu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio César Teodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo de Rosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, Yu Wang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, ZiYi Yang, Donghan Yu, Lu Yuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou

We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.

Ranked #5 on MMR total on MRR-Benchmark (using extra training data)

Language Modelling Math +2

AgentKit: Structured LLM Reasoning with Dynamic Graphs

1 code implementation17 Apr 2024 Yue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen Mcaleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, Tom Mitchell

The chains of nodes can be designed to explicitly enforce a naturally structured "thought process".

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

1 code implementation9 Apr 2024 Junpeng Liu, YiFan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks.

Optical Character Recognition (OCR)

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

no code implementations8 Apr 2024 Zeyuan Allen-Zhu, Yuanzhi Li

More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity.

Quantization

Provably learning a multi-head attention layer

no code implementations6 Feb 2024 Sitan Chen, Yuanzhi Li

In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided $\{\mathbf{W}_i, \mathbf{\Theta}_i\}$ satisfy certain non-degeneracy conditions, we give a $(dk)^{O(m^3)}$-time algorithm that learns $F$ to small error given random labeled examples drawn uniformly from $\{\pm 1\}^{k\times d}$.

TinyGSM: achieving >80% on GSM8k with small language models

no code implementations14 Dec 2023 Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang

Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B.

Arithmetic Reasoning GSM8K +2

Positional Description Matters for Transformers Arithmetic

no code implementations22 Nov 2023 Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, Yi Zhang

For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication.

Memorization

Simple Mechanisms for Representing, Indexing and Manipulating Concepts

no code implementations18 Oct 2023 Yuanzhi Li, Raghu Meka, Rina Panigrahy, Kulin Shah

Deep networks typically learn concepts via classifiers, which involves setting up a model and training it via gradient descent to fit the concept-labeled data.

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

no code implementations2 Oct 2023 Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu

Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e. g., text and images) to improve the model performance.

Image Generation Representation Learning +1

SmartPlay: A Benchmark for LLMs as Intelligent Agents

1 code implementation2 Oct 2023 Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li

We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents.

Minecraft Spatial Reasoning

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

no code implementations25 Sep 2023 Zeyuan Allen-Zhu, Yuanzhi Li

This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.

Question Answering Sentence +1

Textbooks Are All You Need II: phi-1.5 technical report

1 code implementation11 Sep 2023 Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1. 3 billion parameter model with Python coding performance close to the state-of-the-art.

Code Generation Common Sense Reasoning +3

Efficient RLHF: Reducing the Memory Usage of PPO

no code implementations1 Sep 2023 Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen

To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO.

Language Modelling

Length Generalization in Arithmetic Transformers

no code implementations27 Jun 2023 Samy Jelassi, Stéphane d'Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, François Charton

We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on $5$-digit numbers can perform $15$-digit sums.

Position

The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks

no code implementations20 Jun 2023 Yuan Cao, Difan Zou, Yuanzhi Li, Quanquan Gu

We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate.

Binary Classification

Specifying and Solving Robust Empirical Risk Minimization Problems Using CVXPY

1 code implementation9 Jun 2023 Eric Luxenberg, Dhruv Malik, Yuanzhi Li, Aarti Singh, Stephen Boyd

We consider robust empirical risk minimization (ERM), where model parameters are chosen to minimize the worst-case empirical loss when each data point varies over a given convex uncertainty set.

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

no code implementations31 May 2023 Yan Pan, Yuanzhi Li

We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms.

Deep Learning

Physics of Language Models: Part 1, Learning Hierarchical Language Structures

no code implementations23 May 2023 Zeyuan Allen-Zhu, Yuanzhi Li

Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

1 code implementation12 May 2023 Ronen Eldan, Yuanzhi Li

In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3. 5 and GPT-4.

Weighted Tallying Bandits: Overcoming Intractability via Repeated Exposure Optimality

no code implementations4 May 2023 Dhruv Malik, Conor Igoe, Yuanzhi Li, Aarti Singh

Motivated by this, a significant line of work has formalized settings where an action's loss is a function of the number of times that action was recently played in the prior $m$ timesteps, where $m$ corresponds to a bound on human memory capacity.

Recommendation Systems

On the Importance of Contrastive Loss in Multimodal Learning

no code implementations7 Apr 2023 Yunwei Ren, Yuanzhi Li

Recently, contrastive learning approaches (e. g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e. g., image and its caption) of the same data point while keeping the representations of different data points away from each other.

Contrastive Learning

Sparks of Artificial General Intelligence: Early experiments with GPT-4

2 code implementations22 Mar 2023 Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang

We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models.

Arithmetic Reasoning Math Word Problem Solving

The Benefits of Mixup for Feature Learning

no code implementations15 Mar 2023 Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu

We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data).

Data Augmentation

How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

1 code implementation7 Mar 2023 Yuchen Li, Yuanzhi Li, Andrej Risteski

While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking.

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

no code implementations NeurIPS 2023 Yue Wu, Yewen Fan, Paul Pu Liang, Amos Azaria, Yuanzhi Li, Tom M. Mitchell

Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent.

Atari Games

What Matters In The Structured Pruning of Generative Language Models?

1 code implementation7 Feb 2023 Michael Santacroce, Zixin Wen, Yelong Shen, Yuanzhi Li

Auto-regressive large language models such as GPT-3 require enormous computational resources to use.

Text Generation

Vision Transformers provably learn spatial structure

no code implementations13 Oct 2022 Samy Jelassi, Michael E. Sander, Yuanzhi Li

On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association.

Binary Classification Inductive Bias

Dissecting adaptive methods in GANs

no code implementations9 Oct 2022 Samy Jelassi, David Dobre, Arthur Mensch, Yuanzhi Li, Gauthier Gidel

By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training.

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

no code implementations22 Sep 2022 Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, Anru R. Zhang

We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2.

Denoising

Towards Understanding Mixture of Experts in Deep Learning

2 code implementations4 Aug 2022 Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li

To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.

Deep Learning

Towards understanding how momentum improves generalization in deep learning

no code implementations13 Jul 2022 Samy Jelassi, Yuanzhi Li

Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures.

Binary Classification Deep Learning

Learning (Very) Simple Generative Models Is Hard

no code implementations31 May 2022 Sitan Chen, Jerry Li, Yuanzhi Li

Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem.

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

no code implementations12 May 2022 Zixin Wen, Yuanzhi Li

The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head.

Self-Supervised Learning

Complete Policy Regret Bounds for Tallying Bandits

no code implementations24 Apr 2022 Dhruv Malik, Yuanzhi Li, Aarti Singh

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.

Learning Polynomial Transformations

no code implementations8 Apr 2022 Sitan Chen, Jerry Li, Yuanzhi Li, Anru R. Zhang

Our first main result is a polynomial-time algorithm for learning quadratic transformations of Gaussians in a smoothed setting.

Tensor Decomposition

Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs

no code implementations ICLR 2022 Sitan Chen, Jerry Li, Yuanzhi Li, Raghu Meka

Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution.

Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels

1 code implementation NeurIPS 2021 Stefani Karp, Ezra Winston, Yuanzhi Li, Aarti Singh

We therefore propose the "local signal adaptivity" (LSA) phenomenon as one explanation for the superiority of neural networks over kernel methods.

Image Classification

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

no code implementations1 Nov 2021 Yuanzhi Li, Ruosong Wang, Lin F. Yang

Notably, for an RL environment with horizon length $H$, previous work have shown that there is a probably approximately correct (PAC) algorithm that learns an $O(1)$-optimal policy using $\mathrm{polylog}(H)$ episodes of environment interactions when the number of states and actions is fixed.

reinforcement-learning Reinforcement Learning +1

Adam is no better than normalized SGD: Dissecting how adaptivity improves GAN performance

no code implementations29 Sep 2021 Samy Jelassi, Arthur Mensch, Gauthier Gidel, Yuanzhi Li

We empirically show that SGDA with the same vector norm as Adam reaches similar or even better performance than the latter.

On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization

no code implementations29 Sep 2021 Zehao Dou, Yuanzhi Li

To the best of our knowledge, this is the very first result which provides an empirical observation and a strict theoretical guarantee on the one-sided convergence of Adam-type algorithms in min-max optimization.

Vocal Bursts Type Prediction

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

no code implementations25 Aug 2021 Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu

In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.

Deep Learning Image Classification

LoRA: Low-Rank Adaptation of Large Language Models

62 code implementations ICLR 2022 Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

Ranked #2 on parameter-efficient fine-tuning on HellaSwag (using extra training data)

Language Modelling parameter-efficient fine-tuning

Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity

no code implementations15 Jun 2021 Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, Yuanzhi Li

Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces.

Atari Games reinforcement-learning +1

Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

no code implementations4 Jun 2021 Zeyuan Allen-Zhu, Yuanzhi Li

Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions.

Super-Resolution

Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning

no code implementations31 May 2021 Zixin Wen, Yuanzhi Li

We present an underlying principle called $\textbf{feature decoupling}$ to explain the effects of augmentations, where we theoretically characterize how augmentations can reduce the correlations of dense features between positive samples while keeping the correlations of sparse features intact, thereby forcing the neural networks to learn from the self-supervision of sparse features.

Contrastive Learning Self-Supervised Learning

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

1 code implementation ICLR 2021 Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar

We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability.

When Is Generalizable Reinforcement Learning Tractable?

no code implementations NeurIPS 2021 Dhruv Malik, Yuanzhi Li, Pradeep Ravikumar

Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem similar to the training environment.

reinforcement-learning Reinforcement Learning +2

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

no code implementations17 Dec 2020 Zeyuan Allen-Zhu, Yuanzhi Li

Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation.

Deep Learning Knowledge Distillation +1

A law of robustness for two-layers neural networks

no code implementations30 Sep 2020 Sébastien Bubeck, Yuanzhi Li, Dheeraj Nagaraj

We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with $k$ neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than $\sqrt{n/k}$ where $n$ is the number of datapoints.

Vocal Bursts Valence Prediction

Feature Purification: How Adversarial Training Performs Robust Deep Learning

no code implementations20 May 2020 Zeyuan Allen-Zhu, Yuanzhi Li

Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.

Deep Learning

Making Method of Moments Great Again? -- How can GANs learn distributions

no code implementations9 Mar 2020 Yuanzhi Li, Zehao Dou

In GANs, the training of the generator usually stops when the discriminator can no longer distinguish the generator's output from the set of training examples.

Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning

no code implementations13 Jan 2020 Zeyuan Allen-Zhu, Yuanzhi Li

On the technical side, we show for every input dimension $d > 0$, there is a concept class of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions using "backward feature correction."

Binary Classification Deep Learning

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

2 code implementations NeurIPS 2019 Yuanzhi Li, Colin Wei, Tengyu Ma

This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing.

Complexity of Highly Parallel Non-Smooth Convex Optimization

no code implementations NeurIPS 2019 Sébastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron Sidford

Namely we consider optimization algorithms interacting with a highly parallel gradient oracle, that is one that can answer $\mathrm{poly}(d)$ gradient queries in parallel.

What Can ResNet Learn Efficiently, Going Beyond Kernels?

no code implementations NeurIPS 2019 Zeyuan Allen-Zhu, Yuanzhi Li

Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error.

One-Shot Learning

Improved Path-length Regret Bounds for Bandits

no code implementations29 Jan 2019 Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei

We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

no code implementations NeurIPS 2019 Zeyuan Allen-Zhu, Yuanzhi Li, YIngyu Liang

In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations.

Learning Theory Vocal Bursts Valence Prediction

A Convergence Theory for Deep Learning via Over-Parameterization

no code implementations9 Nov 2018 Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

Deep Learning

On the Convergence Rate of Training Recurrent Neural Networks

no code implementations NeurIPS 2019 Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing.

The Well-Tempered Lasso

no code implementations ICML 2018 Yuanzhi Li, Yoram Singer

Every regression parameter in the Lasso changes linearly as a function of the regularization value.

regression

The Well Tempered Lasso

no code implementations8 Jun 2018 Yuanzhi Li, Yoram Singer

Every regression parameter in the Lasso changes linearly as a function of the regularization value.

regression

Online Improper Learning with an Approximation Oracle

no code implementations NeurIPS 2018 Elad Hazan, Wei Hu, Yuanzhi Li, Zhiyuan Li

We revisit the question of reducing online learning to approximate optimization of the offline problem.

Learning Mixtures of Linear Regressions with Nearly Optimal Complexity

no code implementations22 Feb 2018 Yuanzhi Li, YIngyu Liang

Mixtures of Linear Regressions (MLR) is an important mixture model with many applications.

Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

no code implementations ICML 2018 Zeyuan Allen-Zhu, Sébastien Bubeck, Yuanzhi Li

Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy.

Multi-Armed Bandits

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

no code implementations26 Dec 2017 Yuanzhi Li, Tengyu Ma, Hongyang Zhang

We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations.

Neon2: Finding Local Minima via First-Order Oracles

no code implementations NeurIPS 2018 Zeyuan Allen-Zhu, Yuanzhi Li

We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations.

Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach

no code implementations14 Nov 2017 Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang

The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points.

Experimental Design

Sparsity, variance and curvature in multi-armed bandits

no code implementations3 Nov 2017 Sébastien Bubeck, Michael B. Cohen, Yuanzhi Li

In (online) learning theory the concepts of sparsity, variance and curvature are well-understood and are routinely used to obtain refined regret and generalization bounds.

Generalization Bounds Learning Theory +1

Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

no code implementations NeurIPS 2017 Zeyuan Allen-Zhu, Elad Hazan, Wei Hu, Yuanzhi Li

We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball.

Near-Optimal Design of Experiments via Regret Minimization

no code implementations ICML 2017 Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, Yining Wang

We consider computationally tractable methods for the experimental design problem, where k out of n design points of dimension p are selected so that certain optimality criteria are approximately satisfied.

Experimental Design

A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model

no code implementations25 Jul 2017 Xi Chen, Yuanzhi Li, Jieming Mao

We study the active learning problem of top-$k$ ranking from multi-wise comparisons under the popular multinomial logit model.

Active Learning

Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations

1 code implementation ICML 2017 Yuanzhi Li, YIngyu Liang

Non-negative matrix factorization is a basic tool for decomposing data into the feature and weight matrices under non-negativity constraints, and in practice is often solved in the alternating minimization framework.

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

no code implementations NeurIPS 2017 Yuanzhi Li, Yang Yuan

We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization.

Vocal Bursts Valence Prediction

Algorithms and matching lower bounds for approximately-convex optimization

no code implementations NeurIPS 2016 Andrej Risteski, Yuanzhi Li

In recent years, a rapidly increasing number of applications in practice requires solving non-convex objectives, like training neural networks, learning graphical models, maximum likelihood estimation etc.

Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates

no code implementations NeurIPS 2016 Yuanzhi Li, YIngyu Liang, Andrej Risteski

Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints.

Faster Principal Component Regression and Stable Matrix Chebyshev Approximation

no code implementations ICML 2017 Zeyuan Allen-Zhu, Yuanzhi Li

We solve principal component regression (PCR), up to a multiplicative accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{-1})$ black-box calls of ridge regression.

regression

First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate

no code implementations26 Jul 2016 Zeyuan Allen-Zhu, Yuanzhi Li

We provide $\textit{global}$ convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for $k>1$.

Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition

no code implementations ICML 2017 Zeyuan Allen-Zhu, Yuanzhi Li

We study $k$-GenEV, the problem of finding the top $k$ generalized eigenvectors, and $k$-CCA, the problem of finding the top $k$ vectors in canonical-correlation analysis.

LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain

no code implementations NeurIPS 2016 Zeyuan Allen-Zhu, Yuanzhi Li

In the $O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))$ running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.

Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods

no code implementations NeurIPS 2016 Yuanzhi Li, Andrej Risteski

The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family, has been very popular in machine learning due to its "Occam's razor" interpretation.

An optimal algorithm for bandit convex optimization

no code implementations14 Mar 2016 Elad Hazan, Yuanzhi Li

We consider the problem of online convex optimization against an arbitrary adversary with bandit feedback, known as bandit convex optimization.

Recovery guarantee of weighted low-rank approximation via alternating minimization

no code implementations6 Feb 2016 Yuanzhi Li, YIngyu Liang, Andrej Risteski

We show that the properties only need to hold in an average sense and can be achieved by the clipping step.

Matrix Completion

Linear Algebraic Structure of Word Senses, with Applications to Polysemy

1 code implementation TACL 2018 Sanjeev Arora, Yuanzhi Li, YIngyu Liang, Tengyu Ma, Andrej Risteski

A novel aspect of our technique is that each extracted word sense is accompanied by one of about 2000 "discourse atoms" that gives a succinct description of which other words co-occur with that word sense.

Information Retrieval Retrieval +1

A Latent Variable Model Approach to PMI-based Word Embeddings

4 code implementations TACL 2016 Sanjeev Arora, Yuanzhi Li, YIngyu Liang, Tengyu Ma, Andrej Risteski

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods.

Word Embeddings

A Theoretical Analysis of NDCG Type Ranking Measures

no code implementations24 Apr 2013 Yining Wang, Li-Wei Wang, Yuanzhi Li, Di He, Tie-Yan Liu, Wei Chen

We show that NDCG with logarithmic discount has consistent distinguishability although it converges to the same limit for all ranking functions.

Vocal Bursts Type Prediction

Cannot find the paper you are looking for? You can Submit a new open access paper.