You need to log in to edit.

You can create a new account if you don't have one.

Or, discuss a change on Slack.

You can create a new account if you don't have one.

Or, discuss a change on Slack.

no code implementations • 25 Jul 2023 • Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally.

1 code implementation • 7 Jul 2023 • Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.

no code implementations • 5 Jul 2023 • Tianle Cai, Kaixuan Huang, Jason D. Lee, Mengdi Wang

However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations.

no code implementations • 21 Jun 2023 • Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee

We consider a fundamental setting in which the objective function is quadratic, and provide the first tight characterization of the optimal Hessian-dependent sample complexity.

no code implementations • 30 May 2023 • Etash Kumar Guha, Jason D. Lee

Reinforcement Learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics.

no code implementations • 29 May 2023 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

Reinforcement Learning with Human Feedback (RLHF) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals.

1 code implementation • 28 May 2023 • Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime.

2 code implementations • 27 May 2023 • Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory.

no code implementations • 24 May 2023 • Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE.

no code implementations • 19 May 2023 • Jingfeng Wu, Vladimir Braverman, Jason D. Lee

Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates.

no code implementations • 18 May 2023 • Alex Damian, Eshaan Nichani, Rong Ge, Jason D. Lee

Ben Arous et al. (2021) showed that $n \gtrsim d^{k^\star-1}$ samples suffice for learning $w^\star$ and that this is tight for online SGD.

no code implementations • 17 May 2023 • Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, Yuxin Chen

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment.

no code implementations • 11 May 2023 • Eshaan Nichani, Alex Damian, Jason D. Lee

We instantiate our framework in specific statistical learning settings -- single-index models and functions of quadratic features -- and show that in the latter setting three-layer networks obtain a sample complexity improvement over all existing guarantees for two-layer networks.

1 code implementation • 8 May 2023 • Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee

Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.

no code implementations • 3 Mar 2023 • Zhuoqing Song, Jason D. Lee, Zhuoran Yang

Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game.

no code implementations • 22 Feb 2023 • Hanlin Zhu, Ruosong Wang, Jason D. Lee

Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large.

no code implementations • 9 Feb 2023 • Hadi Daneshmand, Jason D. Lee, Chi Jin

Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures.

no code implementations • 5 Feb 2023 • Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap).

1 code implementation • 30 Jan 2023 • Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.

no code implementations • 27 Jan 2023 • Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee

It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models.

no code implementations • 7 Dec 2022 • Zihan Wang, Jason D. Lee, Qi Lei

Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy.

no code implementations • 13 Oct 2022 • Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik Sridharan

When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart).

1 code implementation • 30 Sep 2022 • Alex Damian, Eshaan Nichani, Jason D. Lee

Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions.

no code implementations • 12 Jul 2022 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces.

no code implementations • 30 Jun 2022 • Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi

Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.

no code implementations • 24 Jun 2022 • Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

We study Reinforcement Learning for partially observable dynamical systems using function approximation.

no code implementations • 24 Jun 2022 • Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space.

1 code implementation • 8 Jun 2022 • Eshaan Nichani, Yu Bai, Jason D. Lee

Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own.

no code implementations • 3 Jun 2022 • Wenhao Zhan, Jason D. Lee, Zhuoran Yang

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.

no code implementations • 18 May 2022 • Itay Safran, Gal Vardi, Jason D. Lee

We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting.

no code implementations • 29 Mar 2022 • Jiaqi Yang, Qi Lei, Jason D. Lee, Simon S. Du

We give novel algorithms for multi-task and lifelong linear bandits with shared representation.

no code implementations • 9 Feb 2022 • Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee

Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e. g., Bellman-completeness) and the data coverage (e. g., all-policy concentrability).

no code implementations • 4 Dec 2021 • Itay Safran, Jason D. Lee

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities.

no code implementations • 18 Oct 2021 • Kurtland Chua, Qi Lei, Jason D. Lee

To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task.

no code implementations • 15 Oct 2021 • Xinyi Chen, Edgar Minasyan, Jason D. Lee, Elad Hazan

The theory of deep learning focuses almost exclusively on supervised learning, non-convex optimization using stochastic gradient descent, and overparametrized neural networks.

no code implementations • 29 Sep 2021 • DiJia Su, Jason D. Lee, John Mulvey, H. Vincent Poor

In the high support region (low uncertainty), we encourage our policy by taking an aggressive update.

no code implementations • ICLR 2022 • Baihe Huang, Jason D. Lee, Zhaoran Wang, Zhuoran Yang

In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm.

no code implementations • NeurIPS 2021 • Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions.

no code implementations • NeurIPS 2021 • Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem.

no code implementations • 6 Jul 2021 • Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei

Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning.

no code implementations • 23 Jun 2021 • Qi Lei, Wei Hu, Jason D. Lee

Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain.

no code implementations • NeurIPS 2021 • Alex Damian, Tengyu Ma, Jason D. Lee

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.

no code implementations • 24 May 2021 • Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi

These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.

no code implementations • NeurIPS 2021 • Kurtland Chua, Qi Lei, Jason D. Lee

Representation learning has been widely studied in the context of meta-learning, enabling rapid learning of new tasks through shared representations.

no code implementations • 19 Mar 2021 • Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang

The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the optimal $Q$-function and the optimal $V$-function are linear in some known feature space.

no code implementations • 23 Feb 2021 • DiJia Su, Jason D. Lee, John M. Mulvey, H. Vincent Poor

We consider a setting that lies between pure offline reinforcement learning (RL) and pure online RL called deployment constrained RL in which the number of policy deployments for data sampling is limited.

no code implementations • 22 Feb 2021 • Tianle Cai, Ruiqi Gao, Jason D. Lee, Qi Lei

In this work, we propose a provably effective framework for domain adaptation based on label propagation.

no code implementations • 17 Feb 2021 • Yulai Zhao, Yuandong Tian, Jason D. Lee, Simon S. Du

Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces.

no code implementations • NeurIPS 2020 • Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $\delta \ge 0$.

1 code implementation • NeurIPS 2020 • Yihong Gu, Weizhong Zhang, Cong Fang, Jason D. Lee, Tong Zhang

With the help of a new technique called {\it neural network grafting}, we demonstrate that even during the entire training process, feature distributions of differently initialized networks remain similar at each layer.

no code implementations • NeurIPS 2020 • Xiang Wang, Chenwei Wu, Jason D. Lee, Tengyu Ma, Rong Ge

We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least $m = \Omega(d^{l-1})$, while a variant of gradient descent can find an approximate tensor when $m = O^*(r^{2. 5l}\log d)$.

no code implementations • ICLR 2021 • Jiaqi Yang, Wei Hu, Jason D. Lee, Simon S. Du

For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit.

no code implementations • 12 Oct 2020 • Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason D. Lee, Sham Kakade, Huan Wang, Caiming Xiong

A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.

1 code implementation • NeurIPS 2020 • Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Li-Wei Wang, Jason D. Lee

In this paper, we conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call "initial tickets"), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance.

no code implementations • NeurIPS 2020 • Jason D. Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, Zheng Yu

Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e. g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow.

no code implementations • NeurIPS 2021 • Jason D. Lee, Qi Lei, Nikunj Saunshi, Jiacheng Zhuo

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks) without requiring labeled data to learn useful semantic representations.

no code implementations • NeurIPS 2020 • Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks".

no code implementations • 3 Jul 2020 • Cong Fang, Jason D. Lee, Pengkun Yang, Tong Zhang

This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, and further leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re-parameterization.

no code implementations • NeurIPS 2020 • Minshuo Chen, Yu Bai, Jason D. Lee, Tuo Zhao, Huan Wang, Caiming Xiong, Richard Socher

When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while the best-known sample complexity upper bound for the raw input is $\tilde{O}(d^{p-1})$.

no code implementations • NeurIPS 2020 • Kaiyi Ji, Jason D. Lee, Yingbin Liang, H. Vincent Poor

Although model-agnostic meta-learning (MAML) is a very successful algorithm in meta-learning practice, it can have high computational cost because it updates all model parameters over both the inner loop of task-specific adaptation and the outer-loop of meta initialization training.

1 code implementation • 15 Jun 2020 • Jeff Z. HaoChen, Colin Wei, Jason D. Lee, Tengyu Ma

We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms.

no code implementations • 5 Apr 2020 • Xi Chen, Jason D. Lee, He Li, Yun Yang

To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-$L$-dim eigenspace, we show that our estimator is able to cover the targeted top-$L$-dim population eigenspace.

no code implementations • 23 Mar 2020 • Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, Qiang Liu

Recently, Liu et al.[19] proposed a splitting steepest descent (S2D) method that jointly optimizes the neural parameters and architectures based on progressively growing network structures by splitting neurons into multiple copies in a steepest descent fashion.

no code implementations • ICLR 2021 • Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei

First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation.

1 code implementation • 20 Feb 2020 • Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.

no code implementations • 17 Feb 2020 • Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity $\widetilde{\Theta}\left(\mathrm{dim}_E\right)$ is tight even in the agnostic setting.

no code implementations • NeurIPS 2019 • Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning.

no code implementations • 22 Nov 2019 • Maziar Sanjabi, Sina Baharlouei, Meisam Razaviyayn, Jason D. Lee

We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components.

no code implementations • ICML 2020 • Qi Lei, Jason D. Lee, Alexandros G. Dimakis, Constantinos Daskalakis

Generative adversarial networks (GANs) are a widely used framework for learning generative models.

no code implementations • ICLR 2020 • Yu Bai, Jason D. Lee

Recent theoretical work has established connections between over-parametrized neural networks and linearized models governed by he Neural Tangent Kernels (NTKs).

2 code implementations • ICML 2020 • Ashok Vardhan Makkuva, Amirhossein Taghvaei, Sewoong Oh, Jason D. Lee

Building upon recent advances in the field of input convex neural networks, we propose a new framework where the gradient of one convex function represents the optimal transport mapping.

no code implementations • 1 Aug 2019 • Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces.

no code implementations • NeurIPS 2019 • Ruiqi Gao, Tianle Cai, Haochuan Li, Li-Wei Wang, Cho-Jui Hsieh, Jason D. Lee

Neural networks are vulnerable to adversarial examples, i. e. inputs that are imperceptibly perturbed from natural data and yet incorrectly classified by the network.

1 code implementation • NeurIPS 2019 • Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning.

no code implementations • 17 May 2019 • Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry

With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models.

1 code implementation • NeurIPS 2019 • Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently.

no code implementations • 7 Dec 2018 • Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee

In this short note, we consider the problem of solving a min-max zero-sum game.

no code implementations • NeurIPS 2018 • Sham M. Kakade, Jason D. Lee

The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.

no code implementations • 9 Nov 2018 • Simon S. Du, Jason D. Lee, Haochuan Li, Li-Wei Wang, Xiyu Zhai

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex.

no code implementations • NeurIPS 2019 • Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

no code implementations • 23 Sep 2018 • Sham Kakade, Jason D. Lee

The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of $5$) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.

no code implementations • NeurIPS 2018 • Simon S. Du, Wei Hu, Jason D. Lee

Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization.

no code implementations • NeurIPS 2018 • Shiyu Liang, Ruoyu Sun, Jason D. Lee, R. Srikant

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima.

1 code implementation • 20 Apr 2018 • Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee

This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity?

no code implementations • 5 Mar 2018 • Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P. Savarese, Nathan Srebro, Daniel Soudry

We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails.

1 code implementation • ICML 2018 • Simon S. Du, Jason D. Lee

We provide new theoretical insights on why over-parametrization is effective in learning neural networks.

no code implementations • NeurIPS 2018 • Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, Jason D. Lee

A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions.

no code implementations • ICLR 2018 • Chenwei Wu, Jiajun Luo, Jason D. Lee

Deep learning models can be efficiently optimized via stochastic gradient descent, but there is little theoretical evidence to support this.

no code implementations • ICLR 2018 • Xuanqing Liu, Jason D. Lee, Cho-Jui Hsieh

Solving this subproblem is non-trivial---existing methods have only sub-linear convergence rate.

no code implementations • ICML 2018 • Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, Aarti Singh

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i. e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_j\sigma(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned.

no code implementations • ICLR 2018 • Rong Ge, Jason D. Lee, Tengyu Ma

All global minima of $G$ correspond to the ground truth parameters.

no code implementations • 20 Oct 2017 • Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael. I. Jordan, Benjamin Recht

We establish that first-order methods avoid saddle points for almost all initializations.

no code implementations • ICLR 2018 • Simon S. Du, Jason D. Lee, Yuandong Tian

We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches.

no code implementations • 28 Aug 2017 • Xuanqing Liu, Cho-Jui Hsieh, Jason D. Lee, Yuekai Sun

We propose a fast proximal Newton-type algorithm for minimizing regularized finite sums that returns an $\epsilon$-suboptimal point in $\tilde{\mathcal{O}}(d(n + \sqrt{\kappa d})\log(\frac{1}{\epsilon}))$ FLOPS, where $n$ is number of samples, $d$ is feature dimension, and $\kappa$ is the condition number.

no code implementations • 16 Jul 2017 • Mahdi Soltanolkotabi, Adel Javanmard, Jason D. Lee

In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set.

no code implementations • NeurIPS 2017 • Simon S. Du, Chi Jin, Jason D. Lee, Michael. I. Jordan, Barnabas Poczos, Aarti Singh

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape.

no code implementations • 26 Apr 2017 • Adel Javanmard, Jason D. Lee

By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters.

no code implementations • 27 Oct 2016 • Xi Chen, Jason D. Lee, Xin T. Tong, Yichen Zhang

Second, for high-dimensional linear regression, using a variant of the SGD algorithm, we construct a debiased estimator of each regression coefficient that is asymptotically normal.

no code implementations • 17 Oct 2016 • Qiang Liu, Jason D. Lee

Importance sampling is widely used in machine learning and statistics, but its power is limited by the restriction of using simple proposals for which the importance weights can be tractably calculated.

no code implementations • 10 Oct 2016 • Jialei Wang, Jason D. Lee, Mehrdad Mahdavi, Mladen Kolar, Nathan Srebro

Sketching techniques have become popular for scaling up machine learning algorithms by reducing the sample size or dimensionality of massive data sets, while still maintaining the statistical power of big data.

no code implementations • 25 May 2016 • Michael. I. Jordan, Jason D. Lee, Yun Yang

CSL provides a communication-efficient surrogate to the global likelihood that can be used for low-dimensional estimation, high-dimensional regularized estimation and Bayesian inference.

no code implementations • NeurIPS 2016 • Rong Ge, Jason D. Lee, Tengyu Ma

Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems.

no code implementations • 16 Feb 2016 • Jason D. Lee, Max Simchowitz, Michael. I. Jordan, Benjamin Recht

We show that gradient descent converges to a local minimizer, almost surely with random initialization.

no code implementations • 10 Feb 2016 • Qiang Liu, Jason D. Lee, Michael. I. Jordan

We derive a new discrepancy statistic for measuring differences between two probability distributions based on combining Stein's identity with the reproducing kernel Hilbert space theory.

no code implementations • NeurIPS 2015 • Jason D. Lee, Yuekai Sun, Jonathan E. Taylor

Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data.

no code implementations • 25 Nov 2015 • Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, Michael. I. Jordan

For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $\epsilon>0$.

no code implementations • 13 Oct 2015 • Yuchen Zhang, Jason D. Lee, Michael. I. Jordan

The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in $(1/\epsilon,\log(1/\delta), F(k, L))$, where $F(k, L)$ is a function depending on $(k, L)$ and on the activation function, independent of the number of neurons.

no code implementations • 27 Jul 2015 • Jason D. Lee, Qihang Lin, Tengyu Ma, Tianbao Yang

We also prove a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper.

no code implementations • 30 Jun 2015 • Jason D. Lee

We present the Condition-on-Selection method that allows for valid selective inference, and study its application to the lasso, and several other selection algorithms.

no code implementations • 14 Mar 2015 • Jason D. Lee, Yuekai Sun, Qiang Liu, Jonathan E. Taylor

We devise a one-shot approach to distributed sparse regression in the high-dimensional setting.

1 code implementation • NeurIPS 2014 • Austin R. Benson, Jason D. Lee, Bartek Rajwa, David F. Gleich

We demonstrate the efficacy of these algorithms on terabyte-sized synthetic matrices and real-world matrices from scientific computing and bioinformatics.

no code implementations • NeurIPS 2014 • Jason D. Lee, Jonathan E. Taylor

We develop a framework for post model selection inference, via marginal screening, in linear regression.

no code implementations • NeurIPS 2013 • Jason D. Lee, Yuekai Sun, Jonathan E. Taylor

Penalized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure.

no code implementations • NeurIPS 2013 • Jason D. Lee, Ran Gilad-Bachrach, Rich Caruana

In the mixture models problem it is assumed that there are $K$ distributions $\theta_{1},\ldots,\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients.

no code implementations • 25 Nov 2013 • Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor

We develop a general approach to valid inference after model selection.

no code implementations • 31 May 2013 • Jason D. Lee, Yuekai Sun, Jonathan E. Taylor

Regularized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure.

1 code implementation • 7 Jun 2012 • Jason D. Lee, Yuekai Sun, Michael A. Saunders

We generalize Newton-type methods for minimizing smooth functions to handle a sum of two convex functions: a smooth function and a nonsmooth function with a simple proximal mapping.

no code implementations • 22 May 2012 • Jason D. Lee, Trevor J. Hastie

We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning.

no code implementations • NeurIPS 2010 • Jason D. Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan R. Salakhutdinov

The max-norm was proposed as a convex matrix regularizer by Srebro et al (2004) and was shown to be empirically superior to the trace-norm for collaborative filtering problems.

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.