no code implementations • ICML 2020 • Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, Ali Jadbabaie
Therefore, we introduce the notion of (delta, epsilon)-stationarity, a generalization that allows for a point to be within distance delta of an epsilon-stationary point and reduces to epsilon-stationarity for smooth functions.
no code implementations • 15 Feb 2024 • Xiang Cheng, Jingzhao Zhang, Suvrit Sra
We study the task of efficiently sampling from a Gibbs distribution $d \pi^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice.
1 code implementation • 22 Oct 2023 • Xinran Gu, Kaifeng Lyu, Sanjeev Arora, Jingzhao Zhang, Longbo Huang
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models.
no code implementations • 16 Aug 2023 • Pengkun Yang, Jingzhao Zhang
We show that a scaling law can have two phases: in the first phase, the generalization error depends polynomially on the data dimension and decreases fast; whereas in the second phase, the error depends exponentially on the data dimension and decreases slowly.
no code implementations • 26 Jun 2023 • Lesi Chen, Yaohua Ma, Jingzhao Zhang
Designing efficient algorithms for bilevel optimization is challenging because the lower-level problem defines a feasibility set implicitly via another optimization problem.
no code implementations • NeurIPS 2023 • Xiang Cheng, Bohan Wang, Jingzhao Zhang, Yusong Zhu
However, on the theory side, MCMC algorithms suffer from slow mixing rate when $\pi(x)$ is non-log-concave.
no code implementations • 19 Mar 2023 • Peiyuan Zhang, Jiaye Teng, Jingzhao Zhang
Our paper examines this observation by providing excess risk lower bounds for GD and SGD in two realizable settings: 1) $\eta T = \bigO{n}$, and (2) $\eta T = \bigOmega{n}$, where $n$ is the size of dataset.
no code implementations • 2 Jan 2023 • Lesi Chen, Jing Xu, Jingzhao Zhang
Bilevel optimization reveals the inner structure of otherwise oblique optimization problems, such as hyperparameter tuning, neural architecture search, and meta-learning.
no code implementations • 28 Sep 2022 • Jing Dong, Jingwei Li, Baoxiang Wang, Jingzhao Zhang
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go.
no code implementations • 1 Jun 2022 • Kaiyue Wen, Jiaye Teng, Jingzhao Zhang
Studies on benign overfitting provide insights for the success of overparameterized deep learning models.
no code implementations • 3 Apr 2022 • Kwangjun Ahn, Jingzhao Zhang, Suvrit Sra
Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$.
no code implementations • 13 Feb 2022 • Peiyuan Zhang, Jingzhao Zhang, Suvrit Sra
Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable.
no code implementations • 28 Jan 2022 • Haowei He, Jingzhao Zhang, Yanan Wang, Benben Jiang, Shaobo Huang, Chen Wang, Yang Zhang, Gengang Xiong, Xuebing Han, Dongxu Guo, Guannan He, Minggao Ouyang
In addition to demonstrating how existing deep learning algorithms can be applied to this task, we further develop an algorithm that exploits the data structure of battery systems.
no code implementations • 12 Oct 2021 • Jingzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie
This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks.
1 code implementation • NeurIPS 2021 • Xinran Gu, Kaixuan Huang, Jingzhao Zhang, Longbo Huang
In this case, the convergence of popular FL algorithms such as FedAvg is severely influenced by the straggling devices.
no code implementations • NeurIPS 2021 • Haochuan Li, Yi Tian, Jingzhao Zhang, Ali Jadbabaie
We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable, and strongly concave in the maximization variable.
no code implementations • 5 Feb 2021 • Tiancheng Yu, Yi Tian, Jingzhao Zhang, Suvrit Sra
To our knowledge, this work provides the first provably efficient algorithms for vector-valued Markov games and our theoretical guarantees are near-optimal.
no code implementations • 1 Jan 2021 • Jingzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie
In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses.
1 code implementation • NeurIPS 2023 • Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, Masashi Sugiyama
Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs).
1 code implementation • ICLR 2021 • Jingzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra
The label shift problem refers to the supervised learning setting where the train and test label distributions do not match.
no code implementations • 28 Sep 2020 • Tianxing He, Jingzhao Zhang, Zhiming Zhou, James R. Glass
The exposure bias problem refers to the incrementally distorted generation induced by the training-generation discrepancy, in teacher-forcing training for auto-regressive neural network language models (LM).
no code implementations • 8 Jun 2020 • Jingzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie
We study oracle complexity of gradient based methods for stochastic approximation problems.
no code implementations • 10 Feb 2020 • Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Ali Jadbabaie, Suvrit Sra
In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds.
no code implementations • NeurIPS 2020 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models.
no code implementations • 25 Sep 2019 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models.
1 code implementation • ICLR 2020 • Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie
We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks.
no code implementations • EMNLP 2021 • Tianxing He, Jingzhao Zhang, Zhiming Zhou, James Glass
Exposure bias has been regarded as a central problem for auto-regressive language models (LM).
no code implementations • 13 Dec 2018 • Lu Mi, Macheng Shen, Jingzhao Zhang
This project report compares some known GAN and VAE models proposed prior to 2017.
no code implementations • 10 Nov 2018 • Jingzhao Zhang, Hongyi Zhang, Suvrit Sra
We study smooth stochastic optimization problems on Riemannian manifolds.
no code implementations • NeurIPS 2018 • Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie
We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method.