no code implementations • 29 Oct 2024 • HANLIN ZHANG, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, Sham Kakade

Training large-scale models under given resources requires careful design of parallelism strategies.

no code implementations • 24 Oct 2024 • Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou

Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data.

no code implementations • 3 Oct 2024 • Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang

Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition.

no code implementations • 8 Aug 2024 • Xingwu Chen, Lei Zhao, Difan Zou

Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood.

no code implementations • 18 Jun 2024 • Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang

In this work, we aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.

no code implementations • 17 Jun 2024 • Shi Yan, Yan Liang, Huayu Zhang, Le Zheng, Difan Zou, Binglu Wang

Through integrating the evolutionary correlations across global states in the bidirectional recursion, an explainable Bayesian recurrent neural smoother (EBRNS) is proposed for offline data-assisted fixed-interval state smoothing.

no code implementations • 15 Jun 2024 • Chenyang Zhang, Difan Zou, Yuan Cao

Adam has become one of the most favored optimizers in deep learning problems.

1 code implementation • 4 Jun 2024 • Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Dan Zhang, Difan Zou, Yisong Yue, Ziniu Hu

Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations.

no code implementations • 30 May 2024 • Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs.

no code implementations • 28 May 2024 • Chengxing Xie, Difan Zou

Recent studies have highlighted their proficiency in some simple tasks like writing and coding through various reasoning strategies.

no code implementations • 27 May 2024 • Xunpeng Huang, Difan Zou, Yi-An Ma, Hanze Dong, Tong Zhang

Stochastic gradients have been widely integrated into Langevin-based methods to improve their scalability and efficiency in solving large-scale sampling problems.

no code implementations • 26 May 2024 • Xunpeng Huang, Difan Zou, Hanze Dong, Yi Zhang, Yi-An Ma, Tong Zhang

To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs.

1 code implementation • 22 Apr 2024 • Yujin Han, Difan Zou

GIC trains a spurious attribute classifier based on two key properties of spurious correlations: (1) high correlation between spurious attributes and true labels, and (2) variability in this correlation between datasets with different group distributions.

no code implementations • 18 Apr 2024 • Kun Zhai, Yifeng Gao, Xingjun Ma, Difan Zou, Guangnan Ye, Yu-Gang Jiang

In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research.

no code implementations • 2 Apr 2024 • Xingwu Chen, Difan Zou

Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization.

no code implementations • 26 Mar 2024 • Yifan Hao, Yong Lin, Difan Zou, Tong Zhang

We demonstrate that in this scenario, further increasing the model's parameterization can significantly reduce the OOD loss.

no code implementations • 13 Mar 2024 • Junwei Su, Difan Zou, Chuan Wu

In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem.

no code implementations • 10 Mar 2024 • Xunpeng Huang, Hanze Dong, Difan Zou, Tong Zhang

Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions.

no code implementations • 20 Feb 2024 • Junwei Su, Difan Zou, Zijun Zhang, Chuan Wu

We provide a formal formulation and analysis of the problem, and propose a novel regularization-based technique called Structural-Shift-Risk-Mitigation (SSRM) to mitigate the impact of the structural shift on catastrophic forgetting of the inductive NGIL problem.

no code implementations • 6 Feb 2024 • Junwei Su, Difan Zou, Chuan Wu

Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts.

no code implementations • 12 Jan 2024 • Xunpeng Huang, Difan Zou, Hanze Dong, Yian Ma, Tong Zhang

Specifically, DMC follows the reverse SDE of a diffusion process that transforms the target distribution to the standard Gaussian, utilizing a non-parametric score estimation.

no code implementations • 26 Oct 2023 • Miao Lu, Beining Wu, Xiaodong Yang, Difan Zou

In view of this finding, we call such a phenomenon "benign oscillation".

no code implementations • 12 Oct 2023 • Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Peter L. Bartlett

Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters.

no code implementations • 5 Oct 2023 • Xu Luo, Difan Zou, Lianli Gao, Zenglin Xu, Jingkuan Song

Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data, that is, training a linear classifier upon frozen features extracted from the pretrained model.

no code implementations • 3 Oct 2023 • Xuran Meng, Difan Zou, Yuan Cao

Modern deep learning models are usually highly over-parameterized so that they can overfit the training data.

no code implementations • 20 Jun 2023 • Yuan Cao, Difan Zou, Yuanzhi Li, Quanquan Gu

We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate.

no code implementations • 31 Mar 2023 • Xuran Meng, Yuan Cao, Difan Zou

In this paper, we explore the per-example gradient regularization (PEGR) and present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations.

no code implementations • 15 Mar 2023 • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu

We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data).

no code implementations • 3 Mar 2023 • Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

On the other hand, we provide some negative results for stochastic gradient descent (SGD) for ReLU regression with symmetric Bernoulli data: if the model is well-specified, the excess risk of SGD is provably no better than that of GLM-tron ignoring constant factors, for each problem instance; and in the noiseless case, GLM-tron can achieve a small risk while SGD unavoidably suffers from a constant risk in expectation.

no code implementations • 7 Feb 2023 • Junwei Su, Difan Zou, Chuan Wu

Recent research has witnessed a surge in the exploration of Graph Neural Networks (GNN) in Node-wise Graph Continual Learning (NGCL), a practical yet challenging paradigm involving the continual training of a GNN on node-related tasks.

no code implementations • 3 Aug 2022 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data.

no code implementations • 7 Mar 2022 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.

no code implementations • 12 Oct 2021 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems.

no code implementations • 25 Aug 2021 • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu

In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.

no code implementations • NeurIPS 2021 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches.

no code implementations • 25 Jun 2021 • Spencer Frei, Difan Zou, Zixiang Chen, Quanquan Gu

We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbol{\beta}_0 := \boldsymbol{\beta}_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbol{\beta}_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension.

no code implementations • 19 Apr 2021 • Difan Zou, Spencer Frei, Quanquan Gu

To the best of our knowledge, this is the first work to show that adversarial training provably yields robust classifiers in the presence of noise.

no code implementations • 23 Mar 2021 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors).

no code implementations • ICLR 2021 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu

Understanding the algorithmic bias of \emph{stochastic gradient descent} (SGD) is one of the key challenges in modern machine learning and deep learning theory.

no code implementations • 19 Oct 2020 • Difan Zou, Pan Xu, Quanquan Gu

We provide a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.

2 code implementations • ICLR 2020 • Yisen Wang, Difan Zou, Jin-Feng Yi, James Bailey, Xingjun Ma, Quanquan Gu

In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training.

no code implementations • ICLR 2020 • Difan Zou, Philip M. Long, Quanquan Gu

We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d, k$ are the input and output dimensions respectively.

1 code implementation • NeurIPS 2019 • Difan Zou, Pan Xu, Quanquan Gu

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received increasing attention in both theory and practice.

no code implementations • ICLR 2021 • Zixiang Chen, Yuan Cao, Difan Zou, Quanquan Gu

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees.

1 code implementation • NeurIPS 2019 • Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, Quanquan Gu

Original full-batch GCN training requires calculating the representation of all the nodes in the graph per GCN layer, which brings in high computation and memory costs.

1 code implementation • 2 Nov 2019 • Bao Wang, Difan Zou, Quanquan Gu, Stanley Osher

As an important Markov Chain Monte Carlo (MCMC) method, stochastic gradient Langevin dynamics (SGLD) algorithm has achieved great success in Bayesian learning and posterior sampling.

no code implementations • NeurIPS 2019 • Difan Zou, Quanquan Gu

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i. e., sufficiently wide) deep neural networks.

no code implementations • 21 Nov 2018 • Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu

In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.

no code implementations • ICML 2018 • Difan Zou, Pan Xu, Quanquan Gu

We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution.

no code implementations • 11 Dec 2017 • Yaodong Yu, Difan Zou, Quanquan Gu

We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity.

no code implementations • NeurIPS 2018 • Pan Xu, Jinghui Chen, Difan Zou, Quanquan Gu

Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (SVRG-LD) to the almost minimizer within $\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime.

Cannot find the paper you are looking for? You can
Submit a new open access paper.

Contact us on:
hello@paperswithcode.com
.
Papers With Code is a free resource with all data licensed under CC-BY-SA.