Search Results for author: Jingfeng Wu

Found 25 papers, 6 papers with code

How Does Critical Batch Size Scale in Pre-training?

no code implementations29 Oct 2024 HANLIN ZHANG, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

Training large-scale models under given resources requires careful design of parallelism strategies.


Context-Scaling versus Task-Scaling in In-Context Learning

no code implementations16 Oct 2024 Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu, Mikhail Belkin

While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling.

In-Context Learning

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

no code implementations12 Jun 2024 Yuhang Cai, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett

The typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase.

Scaling Laws in Linear Regression: Compute, Parameters, and Data

no code implementations12 Jun 2024 Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow.


Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

no code implementations24 Feb 2024 Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates.

General Classification

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

no code implementations22 Feb 2024 Ruiqi Zhang, Jingfeng Wu, Peter L. Bartlett

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component.

In-Context Learning

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

no code implementations23 Nov 2023 Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, Quanquan Gu

Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.


How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

no code implementations12 Oct 2023 Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Peter L. Bartlett

Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters.

In-Context Learning regression

Private Federated Frequency Estimation: Adapting to the Hardness of the Instance

no code implementations NeurIPS 2023 Jingfeng Wu, Wennan Zhu, Peter Kairouz, Vladimir Braverman

For single-round FFE, it is known that count sketching is nearly information-theoretically optimal for achieving the fundamental accuracy-communication trade-offs [Chen et al., 2022].

Fixed Design Analysis of Regularization-Based Continual Learning

no code implementations17 Mar 2023 Haoran Li, Jingfeng Wu, Vladimir Braverman

We consider a continual learning (CL) problem with two linear regression tasks in the fixed design setting, where the feature vectors are assumed fixed and the labels are assumed to be random variables.

Continual Learning

Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

no code implementations3 Mar 2023 Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

On the other hand, we provide some negative results for stochastic gradient descent (SGD) for ReLU regression with symmetric Bernoulli data: if the model is well-specified, the excess risk of SGD is provably no better than that of GLM-tron ignoring constant factors, for each problem instance; and in the noiseless case, GLM-tron can achieve a small risk while SGD unavoidably suffers from a constant risk in expectation.

regression Vocal Bursts Intensity Prediction

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

no code implementations3 Aug 2022 Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data.

regression Transfer Learning

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

no code implementations7 Mar 2022 Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

no code implementations12 Oct 2021 Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems.


Gap-Dependent Unsupervised Exploration for Reinforcement Learning

1 code implementation11 Aug 2021 Jingfeng Wu, Vladimir Braverman, Lin F. Yang

In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $\epsilon$.

reinforcement-learning Reinforcement Learning +1

The Benefits of Implicit Regularization from SGD in Least Squares Problems

no code implementations NeurIPS 2021 Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches.


Lifelong Learning with Sketched Structural Regularization

no code implementations17 Apr 2021 Haoran Li, Aditya Krishnan, Jingfeng Wu, Soheil Kolouri, Praveen K. Pilly, Vladimir Braverman

In practice and due to computational constraints, most SR methods crudely approximate the importance matrix by its diagonal.

Continual Learning Permuted-MNIST

Benign Overfitting of Constant-Stepsize SGD for Linear Regression

no code implementations23 Mar 2021 Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors).


Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

1 code implementation NeurIPS 2021 Jingfeng Wu, Vladimir Braverman, Lin F. Yang

We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions.

Multi-Objective Reinforcement Learning reinforcement-learning

Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate

no code implementations ICLR 2021 Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu

Understanding the algorithmic bias of \emph{stochastic gradient descent} (SGD) is one of the key challenges in modern machine learning and deep learning theory.

Learning Theory

Obtaining Adjustable Regularization for Free via Iterate Averaging

1 code implementation ICML 2020 Jingfeng Wu, Vladimir Braverman, Lin F. Yang

In sum, we obtain adjustable regularization for free for a large class of optimization problems and resolve an open question raised by Neu and Rosasco.

Open-Ended Question Answering

On the Noisy Gradient Descent that Generalizes as SGD

1 code implementation ICML 2020 Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, Zhanxing Zhu

The gradient noise of SGD is considered to play a central role in the observed strong generalization abilities of deep learning.

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects

no code implementations ICLR 2019 Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, Jinwen Ma

Along this line, we theoretically study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics.

Tangent-Normal Adversarial Regularization for Semi-supervised Learning

1 code implementation CVPR 2019 Bing Yu, Jingfeng Wu, Jinwen Ma, Zhanxing Zhu

The proposed TNAR is composed by two complementary parts, the tangent adversarial regularization (TAR) and the normal adversarial regularization (NAR).


The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

1 code implementation ICLR 2019 Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, Jinwen Ma

Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics.

Cannot find the paper you are looking for? You can Submit a new open access paper.