Search Results for author: Difan Zou

Found 38 papers, 4 papers with code

Improving Group Robustness on Spurious Correlation Requires Preciser Group Inference

no code implementations22 Apr 2024 Yujin Han, Difan Zou

GIC trains a spurious attribute classifier based on two key properties of spurious correlations: (1) high correlation between spurious attributes and true labels, and (2) variability in this correlation between datasets with different group distributions.

Attribute

The Dog Walking Theory: Rethinking Convergence in Federated Learning

no code implementations18 Apr 2024 Kun Zhai, Yifeng Gao, Xingjun Ma, Difan Zou, Guangnan Ye, Yu-Gang Jiang

In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research.

Federated Learning

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

no code implementations2 Apr 2024 Xingwu Chen, Difan Zou

Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization.

Memorization

On the Benefits of Over-parameterization for Out-of-Distribution Generalization

no code implementations26 Mar 2024 Yifan Hao, Yong Lin, Difan Zou, Tong Zhang

We demonstrate that in this scenario, further increasing the model's parameterization can significantly reduce the OOD loss.

Out-of-Distribution Generalization

Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems

no code implementations13 Mar 2024 Junwei Su, Difan Zou, Chuan Wu

In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem.

regression

An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling

no code implementations10 Mar 2024 Xunpeng Huang, Hanze Dong, Difan Zou, Tong Zhang

Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions.

Towards Robust Graph Incremental Learning on Evolving Graphs

no code implementations20 Feb 2024 Junwei Su, Difan Zou, Zijun Zhang, Chuan Wu

We provide a formal formulation and analysis of the problem, and propose a novel regularization-based technique called Structural-Shift-Risk-Mitigation (SSRM) to mitigate the impact of the structural shift on catastrophic forgetting of the inductive NGIL problem.

Incremental Learning

PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks

no code implementations6 Feb 2024 Junwei Su, Difan Zou, Chuan Wu

Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts.

Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo

no code implementations12 Jan 2024 Xunpeng Huang, Difan Zou, Hanze Dong, Yian Ma, Tong Zhang

Specifically, DMC follows the reverse SDE of a diffusion process that transforms the target distribution to the standard Gaussian, utilizing a non-parametric score estimation.

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

no code implementations12 Oct 2023 Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Peter L. Bartlett

Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters.

In-Context Learning regression

Less is More: On the Feature Redundancy of Pretrained Models When Transferring to Few-shot Tasks

no code implementations5 Oct 2023 Xu Luo, Difan Zou, Lianli Gao, Zenglin Xu, Jingkuan Song

Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data, that is, training a linear classifier upon frozen features extracted from the pretrained model.

Feature Importance

Benign Overfitting in Two-Layer ReLU Convolutional Neural Networks for XOR Data

no code implementations3 Oct 2023 Xuran Meng, Difan Zou, Yuan Cao

Modern deep learning models are usually highly over-parameterized so that they can overfit the training data.

The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks

no code implementations20 Jun 2023 Yuan Cao, Difan Zou, Yuanzhi Li, Quanquan Gu

We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate.

Binary Classification

Per-Example Gradient Regularization Improves Learning Signals from Noisy Data

no code implementations31 Mar 2023 Xuran Meng, Yuan Cao, Difan Zou

In this paper, we explore the per-example gradient regularization (PEGR) and present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations.

Memorization

The Benefits of Mixup for Feature Learning

no code implementations15 Mar 2023 Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu

We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data).

Data Augmentation

Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

no code implementations3 Mar 2023 Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

On the other hand, we provide some negative results for stochastic gradient descent (SGD) for ReLU regression with symmetric Bernoulli data: if the model is well-specified, the excess risk of SGD is provably no better than that of GLM-tron ignoring constant factors, for each problem instance; and in the noiseless case, GLM-tron can achieve a small risk while SGD unavoidably suffers from a constant risk in expectation.

regression Vocal Bursts Intensity Prediction

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

no code implementations3 Aug 2022 Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data.

regression Transfer Learning

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

no code implementations7 Mar 2022 Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

no code implementations12 Oct 2021 Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems.

regression

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

no code implementations25 Aug 2021 Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu

In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.

Image Classification

The Benefits of Implicit Regularization from SGD in Least Squares Problems

no code implementations NeurIPS 2021 Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches.

regression

Self-training Converts Weak Learners to Strong Learners in Mixture Models

no code implementations25 Jun 2021 Spencer Frei, Difan Zou, Zixiang Chen, Quanquan Gu

We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbol{\beta}_0 := \boldsymbol{\beta}_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbol{\beta}_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension.

Binary Classification

Provable Robustness of Adversarial Training for Learning Halfspaces with Noise

no code implementations19 Apr 2021 Difan Zou, Spencer Frei, Quanquan Gu

To the best of our knowledge, this is the first work to show that adversarial training provably yields robust classifiers in the presence of noise.

Classification General Classification +1

Benign Overfitting of Constant-Stepsize SGD for Linear Regression

no code implementations23 Mar 2021 Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors).

regression

Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate

no code implementations ICLR 2021 Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu

Understanding the algorithmic bias of \emph{stochastic gradient descent} (SGD) is one of the key challenges in modern machine learning and deep learning theory.

Learning Theory

Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling

no code implementations19 Oct 2020 Difan Zou, Pan Xu, Quanquan Gu

We provide a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.

Improving Adversarial Robustness Requires Revisiting Misclassified Examples

1 code implementation ICLR 2020 Yisen Wang, Difan Zou, Jin-Feng Yi, James Bailey, Xingjun Ma, Quanquan Gu

In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training.

Adversarial Robustness

On the Global Convergence of Training Deep Linear ResNets

no code implementations ICLR 2020 Difan Zou, Philip M. Long, Quanquan Gu

We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d, k$ are the input and output dimensions respectively.

Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction

1 code implementation NeurIPS 2019 Difan Zou, Pan Xu, Quanquan Gu

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received increasing attention in both theory and practice.

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

no code implementations ICLR 2021 Zixiang Chen, Yuan Cao, Difan Zou, Quanquan Gu

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees.

Open-Ended Question Answering

Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks

1 code implementation NeurIPS 2019 Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, Quanquan Gu

Original full-batch GCN training requires calculating the representation of all the nodes in the graph per GCN layer, which brings in high computation and memory costs.

Node Classification

Laplacian Smoothing Stochastic Gradient Markov Chain Monte Carlo

1 code implementation2 Nov 2019 Bao Wang, Difan Zou, Quanquan Gu, Stanley Osher

As an important Markov Chain Monte Carlo (MCMC) method, stochastic gradient Langevin dynamics (SGLD) algorithm has achieved great success in Bayesian learning and posterior sampling.

An Improved Analysis of Training Over-parameterized Deep Neural Networks

no code implementations NeurIPS 2019 Difan Zou, Quanquan Gu

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i. e., sufficiently wide) deep neural networks.

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

no code implementations21 Nov 2018 Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu

In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.

Binary Classification

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

no code implementations ICML 2018 Difan Zou, Pan Xu, Quanquan Gu

We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution.

Stochastic Optimization

Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently

no code implementations11 Dec 2017 Yaodong Yu, Difan Zou, Quanquan Gu

We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity.

Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

no code implementations NeurIPS 2018 Pan Xu, Jinghui Chen, Difan Zou, Quanquan Gu

Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (SVRG-LD) to the almost minimizer within $\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime.

Cannot find the paper you are looking for? You can Submit a new open access paper.