no code implementations • 7 Nov 2024 • Usman Anwar, Johannes von Oswald, Louis Kirsch, David Krueger, Spencer Frei
This work investigates the vulnerability of in-context learning in transformers to \textit{hijacking attacks} focusing on the setting of linear regression tasks.
no code implementations • 10 Oct 2024 • Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, Gal Vardi
The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks.
1 code implementation • 2 Oct 2024 • Spencer Frei, Gal Vardi
Transformers have the capacity to act as supervised learning algorithms: by properly encoding a set of labeled training ("in-context") examples and an unlabeled test example into an input sequence of vectors of the same dimension, the forward pass of the transformer can produce predictions for that unlabeled test example.
no code implementations • 31 Mar 2024 • Neil Mallinar, Austin Zane, Spencer Frei, Bin Yu
We follow our analysis with empirical studies that show these beneficial and malignant covariate shifts for linear interpolators on real image data, and for fully-connected neural networks in settings where the input data dimension is larger than the training sample size.
no code implementations • 4 Oct 2023 • Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, Wei Hu
Second, they can undergo a period of classical, harmful overfitting -- achieving a perfect fit to training data with near-random performance on test data -- before transitioning ("grokking") to near-optimal generalization later in training.
no code implementations • 6 Aug 2023 • Nikhil Ghosh, Spencer Frei, Wooseok Ha, Bin Yu
On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting.
no code implementations • 16 Jun 2023 • Ruiqi Zhang, Spencer Frei, Peter L. Bartlett
We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts.
no code implementations • 2 Mar 2023 • Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro
Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization.
no code implementations • 13 Oct 2022 • Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, Wei Hu
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data.
no code implementations • 15 Feb 2022 • Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett
We consider data with binary labels that are generated by an XOR-like function of the input features.
no code implementations • 11 Feb 2022 • Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett
Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent.
no code implementations • 25 Jun 2021 • Spencer Frei, Difan Zou, Zixiang Chen, Quanquan Gu
We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbol{\beta}_0 := \boldsymbol{\beta}_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbol{\beta}_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension.
no code implementations • NeurIPS 2021 • Spencer Frei, Quanquan Gu
We further show that many existing guarantees for neural networks trained by gradient descent can be unified through proxy convexity and proxy PL inequalities.
no code implementations • 19 Apr 2021 • Difan Zou, Spencer Frei, Quanquan Gu
To the best of our knowledge, this is the first work to show that adversarial training provably yields robust classifiers in the presence of noise.
1 code implementation • 4 Jan 2021 • Spencer Frei, Yuan Cao, Quanquan Gu
We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by stochastic gradient descent (SGD) following an arbitrary initialization.
no code implementations • 1 Oct 2020 • Spencer Frei, Yuan Cao, Quanquan Gu
We analyze the properties of gradient descent on convex surrogates for the zero-one loss for the agnostic learning of linear halfspaces.
no code implementations • NeurIPS 2020 • Spencer Frei, Yuan Cao, Quanquan Gu
In the agnostic PAC learning setting, where no assumption on the relationship between the labels $y$ and the input $x$ is made, if the optimal population risk is $\mathsf{OPT}$, we show that gradient descent achieves population risk $O(\mathsf{OPT})+\epsilon$ in polynomial time and sample complexity when $\sigma$ is strictly increasing.
no code implementations • NeurIPS 2019 • Spencer Frei, Yuan Cao, Quanquan Gu
The skip-connections used in residual networks have become a standard architecture choice in deep learning due to the increased training stability and generalization performance with this architecture, although there has been limited theoretical understanding for this improvement.