Search Results for author: Peter L. Bartlett

Found 82 papers, 4 papers with code

Sharpness-Aware Minimization and the Edge of Stability

1 code implementation21 Sep 2023 Philip M. Long, Peter L. Bartlett

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value.

Greedy Convex Ensemble

1 code implementation9 Oct 2019 Tan Nguyen, Nan Ye, Peter L. Bartlett

Theoretically, we first consider whether we can use linear, instead of convex, combinations, and obtain generalization results similar to existing ones for learning from a convex hull.

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

no code implementations ICML 2018 Peter L. Bartlett, David P. Helmbold, Philip M. Long

We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $\Phi$, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has excess loss bounded by a small enough constant.

Best of many worlds: Robust model selection for online supervised learning

no code implementations22 May 2018 Vidya Muthukumar, Mitas Ray, Anant Sahai, Peter L. Bartlett

We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings.

Model Selection

Sharp convergence rates for Langevin dynamics in the nonconvex setting

no code implementations4 May 2018 Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, Michael. I. Jordan

We study the problem of sampling from a distribution $p^*(x) \propto \exp\left(-U(x)\right)$, where the function $U$ is $L$-smooth everywhere and $m$-strongly convex outside a ball of radius $R$, but potentially nonconvex inside this ball.

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

no code implementations13 Apr 2018 Peter L. Bartlett, Steven N. Evans, Philip M. Long

This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant.

Online learning with kernel losses

no code implementations27 Feb 2018 Aldo Pacchiano, Niladri S. Chatterji, Peter L. Bartlett

We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm.

On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo

no code implementations ICML 2018 Niladri S. Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L. Bartlett, Michael. I. Jordan

We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion.

Alternating minimization for dictionary learning: Local Convergence Guarantees

no code implementations NeurIPS 2017 Niladri S. Chatterji, Peter L. Bartlett

We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem.

Dictionary Learning

Underdamped Langevin MCMC: A non-asymptotic analysis

no code implementations12 Jul 2017 Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, Michael. I. Jordan

We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave.

FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods

no code implementations26 May 2016 Xiang Cheng, Farbod Roosta-Khorasani, Stefan Palombo, Peter L. Bartlett, Michael W. Mahoney

We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions.

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

no code implementations8 Mar 2017 Peter L. Bartlett, Nick Harvey, Chris Liaw, Abbas Mehrabian

We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function.

Acceleration and Averaging in Stochastic Mirror Descent Dynamics

no code implementations19 Jul 2017 Walid Krichene, Peter L. Bartlett

We discuss the interaction between the parameters of the dynamics (learning rate and averaging weights) and the covariation of the noise process, and show, in particular, how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.

Recovery Guarantees for One-hidden-layer Neural Networks

no code implementations ICML 2017 Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, Inderjit S. Dhillon

For activation functions that are also smooth, we show $\mathit{local~linear~convergence}$ guarantees of gradient descent under a resampling rule.

Hit-and-Run for Sampling and Planning in Non-Convex Spaces

no code implementations19 Oct 2016 Yasin Abbasi-Yadkori, Peter L. Bartlett, Victor Gabillon, Alan Malek

We propose the Hit-and-Run algorithm for planning and sampling problems in non-convex spaces.

Linear Programming for Large-Scale Markov Decision Problems

no code implementations27 Feb 2014 Yasin Abbasi-Yadkori, Peter L. Bartlett, Alan Malek

We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost.

Bounding Embeddings of VC Classes into Maximum Classes

no code implementations29 Jan 2014 J. Hyam Rubinstein, Benjamin I. P. Rubinstein, Peter L. Bartlett

The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings would extend the known compression schemes to all VC classes.

Learning Theory

A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption

no code implementations1 Oct 2018 Peter L. Bartlett, Victor Gabillon, Michal Valko

The difficulty of optimization is measured in terms of 1) the amount of \emph{noise} $b$ of the function evaluation and 2) the local smoothness, $d$, of the function.

Gen-Oja: A Two-time-scale approach for Streaming CCA

no code implementations20 Nov 2018 Kush Bhatia, Aldo Pacchiano, Nicolas Flammarion, Peter L. Bartlett, Michael. I. Jordan

In this paper, we study the problems of principal Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting.

Vocal Bursts Valence Prediction

Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems

no code implementations20 Dec 2018 Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, Martin J. Wainwright

We focus on characterizing the convergence rate of these methods when applied to linear-quadratic systems, and study various settings of driving noise and reward feedback.

Horizon-Independent Minimax Linear Regression

no code implementations NeurIPS 2018 Alan Malek, Peter L. Bartlett

We consider online linear regression: at each round, an adversary reveals a covariate vector, the learner predicts a real value, the adversary reveals a label, and the learner suffers the squared prediction error.

regression

Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation

no code implementations NeurIPS 2018 Kush Bhatia, Aldo Pacchiano, Nicolas Flammarion, Peter L. Bartlett, Michael. I. Jordan

In this paper, we study the problems of principle Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting.

Alternating minimization for dictionary learning with random initialization

no code implementations NeurIPS 2017 Niladri Chatterji, Peter L. Bartlett

However, in contrast to previous theoretical analyses for this problem, we replace a condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term).

Dictionary Learning

Acceleration and Averaging in Stochastic Descent Dynamics

no code implementations NeurIPS 2017 Walid Krichene, Peter L. Bartlett

We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions.

Adaptive Averaging in Accelerated Descent Dynamics

no code implementations NeurIPS 2016 Walid Krichene, Alexandre Bayen, Peter L. Bartlett

This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate $\eta(t)$, and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights $w(t)$.

Efficient Minimax Strategies for Square Loss Games

no code implementations NeurIPS 2014 Wouter M. Koolen, Alan Malek, Peter L. Bartlett

We consider online prediction problems where the loss between the prediction and the outcome is measured by the squared Euclidean distance and its generalization, the squared Mahalanobis distance.

Density Estimation

How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

no code implementations NeurIPS 2013 Jacob Abernethy, Peter L. Bartlett, Rafael Frongillo, Andre Wibisono

We consider a popular problem in finance, option pricing, through the lens of an online learning game between Nature and an Investor.

Information-theoretic lower bounds on the oracle complexity of convex optimization

no code implementations NeurIPS 2009 Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation.

BIG-bench Machine Learning

Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

no code implementations NeurIPS 2007 Ambuj Tewari, Peter L. Bartlett

OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities and the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm.

Large-Scale Markov Decision Problems via the Linear Programming Dual

no code implementations6 Jan 2019 Yasin Abbasi-Yadkori, Peter L. Bartlett, Xi Chen, Alan Malek

Moreover, we propose an efficient algorithm, scaling with the size of the subspace but not the state space, that is able to find a policy with low excess loss relative to the best policy in this class.

Quantitative Weak Convergence for Discrete Stochastic Processes

no code implementations3 Feb 2019 Xiang Cheng, Peter L. Bartlett, Michael. I. Jordan

In this paper, we quantitative convergence in $W_2$ for a family of Langevin-like stochastic processes that includes stochastic gradient descent and related gradient-based algorithms.

Testing Markov Chains without Hitting

no code implementations6 Feb 2019 Yeshwanth Cherapanamjeri, Peter L. Bartlett

We study the problem of identity testing of markov chains.

Fast Mean Estimation with Sub-Gaussian Rates

1 code implementation6 Feb 2019 Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter L. Bartlett

We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^4+n^2d)$ for $n$ i. i. d.~samples and that has error bounds matching the sub-Gaussian case.

OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits

no code implementations24 May 2019 Niladri S. Chatterji, Vidya Muthukumar, Peter L. Bartlett

We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information.

Multi-Armed Bandits

Langevin Monte Carlo without smoothness

no code implementations30 May 2019 Niladri S. Chatterji, Jelena Diakonikolas, Michael. I. Jordan, Peter L. Bartlett

Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant.

Benign Overfitting in Linear Regression

no code implementations26 Jun 2019 Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler

Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction.

regression

Stochastic Gradient and Langevin Processes

no code implementations ICML 2020 Xiang Cheng, Dong Yin, Peter L. Bartlett, Michael. I. Jordan

We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation.

Bayesian Robustness: A Nonasymptotic Viewpoint

no code implementations27 Jul 2019 Kush Bhatia, Yi-An Ma, Anca D. Dragan, Peter L. Bartlett, Michael. I. Jordan

We study the problem of robustly estimating the posterior distribution for the setting where observed data can be contaminated with potentially adversarial outliers.

Binary Classification regression

High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm

no code implementations28 Aug 2019 Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, Michael. I. Jordan

We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with log-concave and smooth densities.

Vocal Bursts Intensity Prediction

An Efficient Sampling Algorithm for Non-smooth Composite Potentials

no code implementations1 Oct 2019 Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, Peter L. Bartlett

We consider the problem of sampling from a density of the form $p(x) \propto \exp(-f(x)- g(x))$, where $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a smooth and strongly convex function and $g: \mathbb{R}^d \rightarrow \mathbb{R}$ is a convex and Lipschitz function.

Infinite-Horizon Policy-Gradient Estimation

no code implementations3 Jun 2011 Jonathan Baxter, Peter L. Bartlett

In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies.

Hebbian Synaptic Modifications in Spiking Neurons that Learn

no code implementations17 Nov 2019 Peter L. Bartlett, Jonathan Baxter

In this paper, we derive a new model of synaptic plasticity, based on recent algorithms for reinforcement learning (in which an agent attempts to learn appropriate actions to maximize its long-term average reward).

reinforcement-learning Reinforcement Learning (RL)

Sampling for Bayesian Mixture Models: MCMC with Polynomial-Time Mixing

no code implementations11 Dec 2019 Wenlong Mou, Nhat Ho, Martin J. Wainwright, Peter L. Bartlett, Michael. I. Jordan

We study the problem of sampling from the power posterior distribution in Bayesian Gaussian mixture models, a robust version of the classical posterior.

Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms

no code implementations1 Feb 2020 Niladri S. Chatterji, Peter L. Bartlett, Philip M. Long

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed.

Self-Distillation Amplifies Regularization in Hilbert Space

no code implementations NeurIPS 2020 Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett

Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another.

Knowledge Distillation L2 Regularization

On Thompson Sampling with Langevin Algorithms

no code implementations ICML 2020 Eric Mazumdar, Aldo Pacchiano, Yi-An Ma, Peter L. Bartlett, Michael. I. Jordan

The resulting approximate Thompson sampling algorithm has logarithmic regret and its computational complexity does not scale with the time horizon of the algorithm.

Thompson Sampling

On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration

no code implementations9 Apr 2020 Wenlong Mou, Chris Junchi Li, Martin J. Wainwright, Peter L. Bartlett, Michael. I. Jordan

When the matrix $\bar{A}$ is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity.

Optimal Robust Linear Regression in Nearly Linear Time

no code implementations16 Jul 2020 Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael. I. Jordan, Nicolas Flammarion, Peter L. Bartlett

We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = \langle X, w^* \rangle + \epsilon$ (with $X \in \mathbb{R}^d$ and $\epsilon$ independent), in which an $\eta$ fraction of the samples have been adversarially corrupted.

regression

Failures of model-dependent generalization bounds for least-norm interpolation

no code implementations16 Oct 2020 Peter L. Bartlett, Philip M. Long

We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data.

Generalization Bounds Learning Theory +1

Optimal Mean Estimation without a Variance

no code implementations24 Nov 2020 Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L. Bartlett, Michael I. Jordan

Concretely, given a sample $\mathbf{X} = \{X_i\}_{i = 1}^n$ from a distribution $\mathcal{D}$ over $\mathbb{R}^d$ with mean $\mu$ which satisfies the following \emph{weak-moment} assumption for some ${\alpha \in [0, 1]}$: \begin{equation*} \forall \|v\| = 1: \mathbb{E}_{X \thicksim \mathcal{D}}[\lvert \langle X - \mu, v\rangle \rvert^{1 + \alpha}] \leq 1, \end{equation*} and given a target failure probability, $\delta$, our goal is to design an estimator which attains the smallest possible confidence interval as a function of $n, d,\delta$.

When does gradient descent with logistic loss find interpolating two-layer networks?

no code implementations4 Dec 2020 Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss.

Binary Classification

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

no code implementations9 Feb 2021 Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence.

Deep learning: a statistical viewpoint

no code implementations16 Mar 2021 Peter L. Bartlett, Andrea Montanari, Alexander Rakhlin

We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting.

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

no code implementations17 Mar 2021 Lin Chen, Bruno Scherrer, Peter L. Bartlett

In this regime, for any $q\in[\gamma^{2}, 1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$.

Off-policy evaluation

Agnostic learning with unknown utilities

no code implementations17 Apr 2021 Kush Bhatia, Peter L. Bartlett, Anca D. Dragan, Jacob Steinhardt

This raises an interesting question whether learning is even possible in our setup, given that obtaining a generalizable estimate of utility $u^*$ might not be possible from finitely many samples.

Preference learning along multiple criteria: A game-theoretic perspective

no code implementations NeurIPS 2020 Kush Bhatia, Ashwin Pananjady, Peter L. Bartlett, Anca D. Dragan, Martin J. Wainwright

Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences.

Autonomous Driving

Adversarial Examples in Multi-Layer Random ReLU Networks

no code implementations NeurIPS 2021 Peter L. Bartlett, Sébastien Bubeck, Yeshwanth Cherapanamjeri

We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters.

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

no code implementations25 Aug 2021 Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data.

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

no code implementations23 Dec 2021 Wenlong Mou, Ashwin Pananjady, Martin J. Wainwright, Peter L. Bartlett

We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms.

Model Selection

Optimal variance-reduced stochastic approximation in Banach spaces

no code implementations21 Jan 2022 Wenlong Mou, Koulik Khamaru, Martin J. Wainwright, Peter L. Bartlett, Michael I. Jordan

We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space.

Q-Learning

Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

no code implementations11 Feb 2022 Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett

Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent.

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

no code implementations15 Feb 2022 Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett

We consider data with binary labels that are generated by an XOR-like function of the input features.

Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency

no code implementations26 Sep 2022 Wenlong Mou, Martin J. Wainwright, Peter L. Bartlett

The problem of estimating a linear functional based on observational data is canonical in both the causal inference and bandit literatures.

Causal Inference

The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

no code implementations4 Oct 2022 Peter L. Bartlett, Philip M. Long, Olivier Bousquet

We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems.

Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

no code implementations13 Oct 2022 Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, Wei Hu

In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data.

Vocal Bursts Intensity Prediction

Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency

no code implementations16 Jan 2023 Wenlong Mou, Peng Ding, Martin J. Wainwright, Peter L. Bartlett

When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function.

regression

Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

no code implementations2 Mar 2023 Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization.

Prediction, Learning, Uniform Convergence, and Scale-sensitive Dimensions

no code implementations21 Apr 2023 Peter L. Bartlett, Philip M. Long

We apply this result, together with techniques due to Haussler and to Benedek and Itai, to obtain new upper bounds on packing numbers in terms of this scale-sensitive notion of dimension.

Trained Transformers Learn Linear Models In-Context

no code implementations16 Jun 2023 Ruiqi Zhang, Spencer Frei, Peter L. Bartlett

We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts.

In-Context Learning regression

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

no code implementations12 Oct 2023 Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Peter L. Bartlett

Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters.

In-Context Learning regression

On the Statistical Properties of Generative Adversarial Models for Low Intrinsic Data Dimension

no code implementations28 Jan 2024 Saptarshi Chakraborty, Peter L. Bartlett

In this paper, we attempt to bridge the gap between the theory and practice of GANs and their bidirectional variant, Bi-directional GANs (BiGANs), by deriving statistical guarantees on the estimated densities in terms of the intrinsic dimension of the data and the latent space.

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

no code implementations22 Feb 2024 Ruiqi Zhang, Jingfeng Wu, Peter L. Bartlett

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component.

In-Context Learning

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

no code implementations24 Feb 2024 Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates.

General Classification

A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data

no code implementations24 Feb 2024 Saptarshi Chakraborty, Peter L. Bartlett

To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen.

Cannot find the paper you are looking for? You can Submit a new open access paper.