no code implementations • 7 Feb 2025 • Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei
Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher.
1 code implementation • 1 Jan 2025 • Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing.
no code implementations • 9 Dec 2024 • Eshaan Nichani, Jason D. Lee, Alberto Bietti
We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count.
no code implementations • 26 Nov 2024 • Hengyu Fu, ZiHao Wang, Eshaan Nichani, Jason D. Lee
This can be viewed as a nonlinear generalization of the multi-index model \citep{damian2022neural}, and also an expansion upon previous work that focused only on a single nonlinear feature, i. e. $r = 1$ \citep{nichani2023provable, wang2023learning}.
no code implementations • 26 Nov 2024 • Zihan Zhang, Jason D. Lee, Simon S. Du, Yuxin Chen
This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees.
no code implementations • 31 Oct 2024 • Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, Jason D. Lee
A key difficulty is that much of an optimizer's behavior is implicitly determined by complex oscillatory dynamics, referred to as the "edge of stability."
no code implementations • 30 Oct 2024 • Yunwei Ren, Zixuan Wang, Jason D. Lee
Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge.
no code implementations • 13 Oct 2024 • Yunwei Ren, Jason D. Lee
Based on the theory of information exponent, when the lowest degree is $2L$, recovering the directions requires $d^{2L-1}\mathrm{poly}(P)$ samples, and when the lowest degree is $2$, only the relevant subspace (not the exact directions) can be recovered due to the rotational invariance of the second-order terms.
1 code implementation • 7 Oct 2024 • Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu
Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data.
1 code implementation • 6 Oct 2024 • Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun
Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop.
no code implementations • 1 Oct 2024 • Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason D. Lee, Daniel R. Jiang, Yonathan Efroni
We study the problem of learning an approximate equilibrium in the offline multi-agent reinforcement learning (MARL) setting.
no code implementations • 18 Jul 2024 • Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J. Foster
Language model alignment methods, such as reinforcement learning from human feedback (RLHF), have led to impressive advances in language model capabilities, but existing techniques are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model plateaus or degrades over the course of the alignment process.
no code implementations • 28 Jun 2024 • Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee
Optimization of convex functions under stochastic zeroth-order feedback has been a major and challenging question in online learning.
no code implementations • 12 Jun 2024 • Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee
Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow.
no code implementations • 11 Jun 2024 • Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee
The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information.
no code implementations • 3 Jun 2024 • Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu
We study the problem of gradient descent learning of a single-index target function $f_*(\boldsymbol{x}) = \textstyle\sigma_*\left(\langle\boldsymbol{x},\boldsymbol{\theta}\rangle\right)$ under isotropic Gaussian data in $\mathbb{R}^d$, where the unknown link function $\sigma_*:\mathbb{R}\to\mathbb{R}$ has information exponent $p$ (defined as the lowest degree in the Hermite expansion).
3 code implementations • 25 Apr 2024 • Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models.
1 code implementation • 12 Apr 2024 • Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
Motivated by the fact that offline preference dataset provides informative states (i. e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution.
no code implementations • 15 Mar 2024 • Zihan Zhang, Jason D. Lee, Yuxin Chen, Simon S. Du
A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a. k. a.~the horizon-free bounds.
no code implementations • 8 Mar 2024 • Alex Damian, Loucas Pillaud-Vivien, Jason D. Lee, Joan Bruna
Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation.
no code implementations • 5 Mar 2024 • Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee
Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression.
1 code implementation • 22 Feb 2024 • Eshaan Nichani, Alex Damian, Jason D. Lee
The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens.
1 code implementation • 19 Feb 2024 • Uijeong Jang, Jason D. Lee, Ernest K. Ryu
Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited.
1 code implementation • 18 Feb 2024 • Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen
In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard.
1 code implementation • 15 Feb 2024 • James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.
no code implementations • 28 Jan 2024 • Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy
Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted.
1 code implementation • 19 Jan 2024 • Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration.
no code implementations • 13 Dec 2023 • Baihe Huang, Hanlin Zhu, Banghua Zhu, Kannan Ramchandran, Michael I. Jordan, Jason D. Lee, Jiantao Jiao
Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error.
no code implementations • 8 Dec 2023 • Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee
Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound.
1 code implementation • 30 Nov 2023 • Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu
Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy.
no code implementations • 23 Nov 2023 • ZiHao Wang, Eshaan Nichani, Jason D. Lee
Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time.
no code implementations • 20 Nov 2023 • Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen Sun, Jason D. Lee
We study CVaR RL in low-rank MDPs with nonlinear function approximation.
1 code implementation • 14 Nov 2023 • Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation.
no code implementations • 25 Jul 2023 • Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du
While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally.
1 code implementation • 7 Jul 2023 • Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos
Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.
no code implementations • 5 Jul 2023 • Tianle Cai, Kaixuan Huang, Jason D. Lee, Mengdi Wang
However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations.
no code implementations • NeurIPS 2023 • Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee
We consider a fundamental setting in which the objective function is quadratic, and provide the first tight characterization of the optimal Hessian-dependent sample complexity.
no code implementations • 29 May 2023 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals.
1 code implementation • 28 May 2023 • Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su
This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime.
2 code implementations • NeurIPS 2023 • Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory.
no code implementations • 24 May 2023 • Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE.
1 code implementation • 8 May 2023 • Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee
Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
no code implementations • 3 Mar 2023 • Zhuoqing Song, Jason D. Lee, Zhuoran Yang
Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game.
no code implementations • 22 Feb 2023 • Hanlin Zhu, Ruosong Wang, Jason D. Lee
Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large.
no code implementations • 9 Feb 2023 • Hadi Daneshmand, Jason D. Lee, Chi Jin
Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures.
1 code implementation • 30 Jan 2023 • Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos
We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.
no code implementations • 27 Jan 2023 • Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models.
no code implementations • 7 Dec 2022 • Zihan Wang, Jason D. Lee, Qi Lei
Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy.
no code implementations • 13 Oct 2022 • Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik Sridharan
When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart).
1 code implementation • 30 Sep 2022 • Alex Damian, Eshaan Nichani, Jason D. Lee
Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions.
no code implementations • 12 Jul 2022 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces.
no code implementations • 30 Jun 2022 • Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi
Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.
no code implementations • 24 Jun 2022 • Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun
We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space.
no code implementations • 24 Jun 2022 • Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun
We study Reinforcement Learning for partially observable dynamical systems using function approximation.
1 code implementation • 8 Jun 2022 • Eshaan Nichani, Yu Bai, Jason D. Lee
Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own.
no code implementations • 3 Jun 2022 • Wenhao Zhan, Jason D. Lee, Zhuoran Yang
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
no code implementations • 18 May 2022 • Itay Safran, Gal Vardi, Jason D. Lee
We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting.
no code implementations • 29 Mar 2022 • Jiaqi Yang, Qi Lei, Jason D. Lee, Simon S. Du
We give novel algorithms for multi-task and lifelong linear bandits with shared representation.
no code implementations • 9 Feb 2022 • Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee
Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e. g., Bellman-completeness) and the data coverage (e. g., all-policy concentrability).
no code implementations • 4 Dec 2021 • Itay Safran, Jason D. Lee
Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities.
no code implementations • 18 Oct 2021 • Kurtland Chua, Qi Lei, Jason D. Lee
To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task.
no code implementations • 15 Oct 2021 • Xinyi Chen, Edgar Minasyan, Jason D. Lee, Elad Hazan
The theory of deep learning focuses almost exclusively on supervised learning, non-convex optimization using stochastic gradient descent, and overparametrized neural networks.
no code implementations • 29 Sep 2021 • DiJia Su, Jason D. Lee, John Mulvey, H. Vincent Poor
In the high support region (low uncertainty), we encourage our policy by taking an aggressive update.
no code implementations • ICLR 2022 • Baihe Huang, Jason D. Lee, Zhaoran Wang, Zhuoran Yang
In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm.
no code implementations • NeurIPS 2021 • Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang
While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions.
no code implementations • NeurIPS 2021 • Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang
This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem.
no code implementations • 6 Jul 2021 • Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei
Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning.
no code implementations • 23 Jun 2021 • Qi Lei, Wei Hu, Jason D. Lee
Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain.
no code implementations • NeurIPS 2021 • Alex Damian, Tengyu Ma, Jason D. Lee
In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
no code implementations • 24 May 2021 • Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi
These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.
no code implementations • NeurIPS 2021 • Kurtland Chua, Qi Lei, Jason D. Lee
Representation learning has been widely studied in the context of meta-learning, enabling rapid learning of new tasks through shared representations.
no code implementations • 19 Mar 2021 • Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang
The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the optimal $Q$-function and the optimal $V$-function are linear in some known feature space.
no code implementations • 23 Feb 2021 • DiJia Su, Jason D. Lee, John M. Mulvey, H. Vincent Poor
We consider a setting that lies between pure offline reinforcement learning (RL) and pure online RL called deployment constrained RL in which the number of policy deployments for data sampling is limited.
no code implementations • 22 Feb 2021 • Tianle Cai, Ruiqi Gao, Jason D. Lee, Qi Lei
In this work, we propose a provably effective framework for domain adaptation based on label propagation.
no code implementations • 17 Feb 2021 • Yulai Zhao, Yuandong Tian, Jason D. Lee, Simon S. Du
Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces.
1 code implementation • NeurIPS 2020 • Yihong Gu, Weizhong Zhang, Cong Fang, Jason D. Lee, Tong Zhang
With the help of a new technique called {\it neural network grafting}, we demonstrate that even during the entire training process, feature distributions of differently initialized networks remain similar at each layer.
no code implementations • NeurIPS 2020 • Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang
The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $\delta \ge 0$.
no code implementations • NeurIPS 2020 • Xiang Wang, Chenwei Wu, Jason D. Lee, Tengyu Ma, Rong Ge
We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least $m = \Omega(d^{l-1})$, while a variant of gradient descent can find an approximate tensor when $m = O^*(r^{2. 5l}\log d)$.
no code implementations • ICLR 2021 • Jiaqi Yang, Wei Hu, Jason D. Lee, Simon S. Du
For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit.
no code implementations • 12 Oct 2020 • Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason D. Lee, Sham Kakade, Huan Wang, Caiming Xiong
A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.
1 code implementation • NeurIPS 2020 • Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Li-Wei Wang, Jason D. Lee
In this paper, we conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call "initial tickets"), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance.
no code implementations • NeurIPS 2020 • Jason D. Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, Zheng Yu
Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e. g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow.
no code implementations • NeurIPS 2021 • Jason D. Lee, Qi Lei, Nikunj Saunshi, Jiacheng Zhuo
Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks) without requiring labeled data to learn useful semantic representations.
no code implementations • NeurIPS 2020 • Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry
We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks".
no code implementations • 3 Jul 2020 • Cong Fang, Jason D. Lee, Pengkun Yang, Tong Zhang
This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, and further leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re-parameterization.
no code implementations • NeurIPS 2020 • Minshuo Chen, Yu Bai, Jason D. Lee, Tuo Zhao, Huan Wang, Caiming Xiong, Richard Socher
When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while the best-known sample complexity upper bound for the raw input is $\tilde{O}(d^{p-1})$.
no code implementations • NeurIPS 2020 • Kaiyi Ji, Jason D. Lee, Yingbin Liang, H. Vincent Poor
Although model-agnostic meta-learning (MAML) is a very successful algorithm in meta-learning practice, it can have high computational cost because it updates all model parameters over both the inner loop of task-specific adaptation and the outer-loop of meta initialization training.
1 code implementation • 15 Jun 2020 • Jeff Z. HaoChen, Colin Wei, Jason D. Lee, Tengyu Ma
We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms.
no code implementations • 5 Apr 2020 • Xi Chen, Jason D. Lee, He Li, Yun Yang
To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-$L$-dim eigenspace, we show that our estimator is able to cover the targeted top-$L$-dim population eigenspace.
no code implementations • 23 Mar 2020 • Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, Qiang Liu
Recently, Liu et al.[19] proposed a splitting steepest descent (S2D) method that jointly optimizes the neural parameters and architectures based on progressively growing network structures by splitting neurons into multiple copies in a steepest descent fashion.
no code implementations • ICLR 2021 • Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei
First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation.
1 code implementation • 20 Feb 2020 • Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro
We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
no code implementations • 17 Feb 2020 • Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang
2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity $\widetilde{\Theta}\left(\mathrm{dim}_E\right)$ is tight even in the agnostic setting.
no code implementations • NeurIPS 2019 • Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning.
no code implementations • 22 Nov 2019 • Maziar Sanjabi, Sina Baharlouei, Meisam Razaviyayn, Jason D. Lee
We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components.
no code implementations • ICML 2020 • Qi Lei, Jason D. Lee, Alexandros G. Dimakis, Constantinos Daskalakis
Generative adversarial networks (GANs) are a widely used framework for learning generative models.
no code implementations • ICLR 2020 • Yu Bai, Jason D. Lee
Recent theoretical work has established connections between over-parametrized neural networks and linearized models governed by he Neural Tangent Kernels (NTKs).
2 code implementations • ICML 2020 • Ashok Vardhan Makkuva, Amirhossein Taghvaei, Sewoong Oh, Jason D. Lee
Building upon recent advances in the field of input convex neural networks, we propose a new framework where the gradient of one convex function represents the optimal transport mapping.
no code implementations • 1 Aug 2019 • Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces.
no code implementations • NeurIPS 2019 • Ruiqi Gao, Tianle Cai, Haochuan Li, Li-Wei Wang, Cho-Jui Hsieh, Jason D. Lee
Neural networks are vulnerable to adversarial examples, i. e. inputs that are imperceptibly perturbed from natural data and yet incorrectly classified by the network.
1 code implementation • NeurIPS 2019 • Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning.
no code implementations • 17 May 2019 • Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry
With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models.
1 code implementation • NeurIPS 2019 • Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn
In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently.
no code implementations • 7 Dec 2018 • Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee
In this short note, we consider the problem of solving a min-max zero-sum game.
no code implementations • NeurIPS 2018 • Sham M. Kakade, Jason D. Lee
The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.
no code implementations • 9 Nov 2018 • Simon S. Du, Jason D. Lee, Haochuan Li, Li-Wei Wang, Xiyu Zhai
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex.
no code implementations • NeurIPS 2019 • Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma
We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.
no code implementations • 23 Sep 2018 • Sham Kakade, Jason D. Lee
The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of $5$) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.
no code implementations • NeurIPS 2018 • Simon S. Du, Wei Hu, Jason D. Lee
Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization.
no code implementations • NeurIPS 2018 • Shiyu Liang, Ruoyu Sun, Jason D. Lee, R. Srikant
One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima.
1 code implementation • 20 Apr 2018 • Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee
This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity?
no code implementations • 5 Mar 2018 • Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P. Savarese, Nathan Srebro, Daniel Soudry
We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails.
1 code implementation • ICML 2018 • Simon S. Du, Jason D. Lee
We provide new theoretical insights on why over-parametrization is effective in learning neural networks.
no code implementations • NeurIPS 2018 • Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, Jason D. Lee
A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions.
no code implementations • ICLR 2018 • Chenwei Wu, Jiajun Luo, Jason D. Lee
Deep learning models can be efficiently optimized via stochastic gradient descent, but there is little theoretical evidence to support this.
no code implementations • ICLR 2018 • Xuanqing Liu, Jason D. Lee, Cho-Jui Hsieh
Solving this subproblem is non-trivial---existing methods have only sub-linear convergence rate.
no code implementations • ICML 2018 • Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, Aarti Singh
We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i. e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_j\sigma(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned.
no code implementations • ICLR 2018 • Rong Ge, Jason D. Lee, Tengyu Ma
All global minima of $G$ correspond to the ground truth parameters.
no code implementations • 20 Oct 2017 • Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael. I. Jordan, Benjamin Recht
We establish that first-order methods avoid saddle points for almost all initializations.
no code implementations • ICLR 2018 • Simon S. Du, Jason D. Lee, Yuandong Tian
We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches.
no code implementations • 28 Aug 2017 • Xuanqing Liu, Cho-Jui Hsieh, Jason D. Lee, Yuekai Sun
We propose a fast proximal Newton-type algorithm for minimizing regularized finite sums that returns an $\epsilon$-suboptimal point in $\tilde{\mathcal{O}}(d(n + \sqrt{\kappa d})\log(\frac{1}{\epsilon}))$ FLOPS, where $n$ is number of samples, $d$ is feature dimension, and $\kappa$ is the condition number.
no code implementations • 16 Jul 2017 • Mahdi Soltanolkotabi, Adel Javanmard, Jason D. Lee
In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set.
no code implementations • NeurIPS 2017 • Simon S. Du, Chi Jin, Jason D. Lee, Michael. I. Jordan, Barnabas Poczos, Aarti Singh
Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape.
no code implementations • 26 Apr 2017 • Adel Javanmard, Jason D. Lee
By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters.
no code implementations • 27 Oct 2016 • Xi Chen, Jason D. Lee, Xin T. Tong, Yichen Zhang
Second, for high-dimensional linear regression, using a variant of the SGD algorithm, we construct a debiased estimator of each regression coefficient that is asymptotically normal.
no code implementations • 17 Oct 2016 • Qiang Liu, Jason D. Lee
Importance sampling is widely used in machine learning and statistics, but its power is limited by the restriction of using simple proposals for which the importance weights can be tractably calculated.
no code implementations • 10 Oct 2016 • Jialei Wang, Jason D. Lee, Mehrdad Mahdavi, Mladen Kolar, Nathan Srebro
Sketching techniques have become popular for scaling up machine learning algorithms by reducing the sample size or dimensionality of massive data sets, while still maintaining the statistical power of big data.
no code implementations • 25 May 2016 • Michael. I. Jordan, Jason D. Lee, Yun Yang
CSL provides a communication-efficient surrogate to the global likelihood that can be used for low-dimensional estimation, high-dimensional regularized estimation and Bayesian inference.
no code implementations • NeurIPS 2016 • Rong Ge, Jason D. Lee, Tengyu Ma
Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems.
no code implementations • 16 Feb 2016 • Jason D. Lee, Max Simchowitz, Michael. I. Jordan, Benjamin Recht
We show that gradient descent converges to a local minimizer, almost surely with random initialization.
no code implementations • 10 Feb 2016 • Qiang Liu, Jason D. Lee, Michael. I. Jordan
We derive a new discrepancy statistic for measuring differences between two probability distributions based on combining Stein's identity with the reproducing kernel Hilbert space theory.
no code implementations • NeurIPS 2015 • Jason D. Lee, Yuekai Sun, Jonathan E. Taylor
Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data.
no code implementations • 25 Nov 2015 • Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, Michael. I. Jordan
For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $\epsilon>0$.
no code implementations • 13 Oct 2015 • Yuchen Zhang, Jason D. Lee, Michael. I. Jordan
The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in $(1/\epsilon,\log(1/\delta), F(k, L))$, where $F(k, L)$ is a function depending on $(k, L)$ and on the activation function, independent of the number of neurons.
no code implementations • 27 Jul 2015 • Jason D. Lee, Qihang Lin, Tengyu Ma, Tianbao Yang
We also prove a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper.
no code implementations • 30 Jun 2015 • Jason D. Lee
We present the Condition-on-Selection method that allows for valid selective inference, and study its application to the lasso, and several other selection algorithms.
no code implementations • 14 Mar 2015 • Jason D. Lee, Yuekai Sun, Qiang Liu, Jonathan E. Taylor
We devise a one-shot approach to distributed sparse regression in the high-dimensional setting.
1 code implementation • NeurIPS 2014 • Austin R. Benson, Jason D. Lee, Bartek Rajwa, David F. Gleich
We demonstrate the efficacy of these algorithms on terabyte-sized synthetic matrices and real-world matrices from scientific computing and bioinformatics.
no code implementations • NeurIPS 2014 • Jason D. Lee, Jonathan E. Taylor
We develop a framework for post model selection inference, via marginal screening, in linear regression.
no code implementations • NeurIPS 2013 • Jason D. Lee, Yuekai Sun, Jonathan E. Taylor
Penalized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure.
no code implementations • NeurIPS 2013 • Jason D. Lee, Ran Gilad-Bachrach, Rich Caruana
In the mixture models problem it is assumed that there are $K$ distributions $\theta_{1},\ldots,\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients.
no code implementations • 25 Nov 2013 • Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor
We develop a general approach to valid inference after model selection.
no code implementations • 31 May 2013 • Jason D. Lee, Yuekai Sun, Jonathan E. Taylor
Regularized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure.
1 code implementation • 7 Jun 2012 • Jason D. Lee, Yuekai Sun, Michael A. Saunders
We generalize Newton-type methods for minimizing smooth functions to handle a sum of two convex functions: a smooth function and a nonsmooth function with a simple proximal mapping.
no code implementations • 22 May 2012 • Jason D. Lee, Trevor J. Hastie
We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning.
no code implementations • NeurIPS 2010 • Jason D. Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan R. Salakhutdinov
The max-norm was proposed as a convex matrix regularizer by Srebro et al (2004) and was shown to be empirically superior to the trace-norm for collaborative filtering problems.