no code implementations • 1 Mar 2024 • Toki Tahmid Inan, Mingrui Liu, Amarda Shehu
Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework.
1 code implementation • 17 Jan 2024 • Jie Hao, Xiaochuan Gong, Mingrui Liu
When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires $\widetilde{\mathcal{O}}(1/\epsilon^4)$ iterations to find an $\epsilon$-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle.
no code implementations • 2 Oct 2023 • Yunwen Lei, Tao Sun, Mingrui Liu
We show both minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.
1 code implementation • 14 Feb 2023 • Michael Crawshaw, Yajie Bao, Mingrui Liu
In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting.
1 code implementation • 15 Nov 2022 • Zheng Wang, Mingrui Liu, Cheng Long, Qianru Zhang, Jiangneng Li, Chunyan Miao
The DeepSEI model incorporates two networks called deep network and recurrent network, which extract the features of the mobility records from three aspects, namely spatiality, temporality and activity, one at a coarse level and the other at a detailed level.
no code implementations • 23 Aug 2022 • Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei zhang, Zhenxun Zhuang
We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating the others.
no code implementations • 17 Jul 2022 • Yajie Bao, Michael Crawshaw, Shan Luo, Mingrui Liu
This paper investigates a class of composite optimization and statistical recovery problems in the FL setting, whose loss function consists of a data-dependent smooth loss and a non-smooth regularizer.
no code implementations • 27 May 2022 • Kaiyi Ji, Mingrui Liu, Yingbin Liang, Lei Ying
Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations.
1 code implementation • 10 May 2022 • Mingrui Liu, Zhenxun Zhuang, Yunwei Lei, Chunyang Liao
Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
1 code implementation • 31 Jan 2022 • Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona
First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$.
no code implementations • 2 Dec 2021 • Wei zhang, Mingrui Liu, Yu Feng, Xiaodong Cui, Brian Kingsbury, Yuhai Tu
We conduct extensive studies over 18 state-of-the-art DL models/tasks and demonstrate that DPSGD often converges in cases where SSGD diverges for large learning rates in the large batch setting.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • NeurIPS 2021 • Yunwen Lei, Mingrui Liu, Yiming Ying
We develop a novel high-probability generalization bound for uniformly-stable algorithms to incorporate the variance information for better generalization, based on which we establish the first nonsmooth learning algorithm to achieve almost optimal high-probability and dimension-independent generalization bounds in linear time.
no code implementations • 21 Oct 2021 • Xiaodong Cui, Wei zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung
Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 27 Feb 2021 • Mingrui Liu, Francesco Orabona
This means that the convergence speed does not have any improvement even if the algorithm starts from the optimal solution, and hence, is oblivious to the initialization.
no code implementations • 13 Feb 2021 • Xiaoyu Li, Mingrui Liu, Francesco Orabona
In this paper, we focus on the convergence rate of the last iterate of SGDM.
no code implementations • 1 Jan 2021 • Wei zhang, Mingrui Liu, Yu Feng, Brian Kingsbury, Yuhai Tu
We conduct extensive studies over 12 state-of-the-art DL models/tasks and demonstrate that DPSGD consistently outperforms SSGD in the large batch setting; and DPSGD converges in cases where SSGD diverges for large learning rates.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 24 Nov 2020 • Mingrui Liu, Wei zhang, Francesco Orabona, Tianbao Yang
As a result, Adam$^+$ requires few parameter tuning, as Adam, but it enjoys a provable convergence guarantee.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • ICML 2020 • Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, Tianbao Yang
In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model.
no code implementations • 4 Feb 2020 • Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny
Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks.
no code implementations • ICLR 2020 • Yunhui Guo, Mingrui Liu, Yandong Li, Liqiang Wang, Tianbao Yang, Tajana Rosing
We evaluate the effectiveness of traditional attack methods such as FGSM and PGD. The results show that A-GEM still possesses strong continual learning ability in the presence of adversarial examples in the memory and simple defense techniques such as label smoothing can further alleviate the adversarial effects.
no code implementations • ICLR 2020 • Mingrui Liu, Youssef Mroueh, Jerret Ross, Wei zhang, Xiaodong Cui, Payel Das, Tianbao Yang
Then we propose an adaptive variant of OSG named Optimistic Adagrad (OAdagrad) and reveal an \emph{improved} adaptive complexity $O\left(\epsilon^{-\frac{2}{1-\alpha}}\right)$, where $\alpha$ characterizes the growth rate of the cumulative stochastic gradient and $0\leq \alpha\leq 1/2$.
no code implementations • NeurIPS 2020 • Mingrui Liu, Wei zhang, Youssef Mroueh, Xiaodong Cui, Jerret Ross, Tianbao Yang, Payel Das
Despite recent progress on decentralized algorithms for training deep neural networks, it remains unclear whether it is possible to train GANs in a decentralized manner.
1 code implementation • NeurIPS 2020 • Yunhui Guo, Mingrui Liu, Tianbao Yang, Tajana Rosing
This view leads to two improved schemes for episodic memory based lifelong learning, called MEGA-I and MEGA-II.
no code implementations • 25 Sep 2019 • Yunhui Guo, Mingrui Liu, Tianbao Yang, Tajana Rosing
In this paper, we introduce a novel and effective lifelong learning algorithm, called MixEd stochastic GrAdient (MEGA), which allows deep neural networks to acquire the ability of retaining performance on old tasks while learning new tasks.
no code implementations • ICLR 2020 • Mingrui Liu, Zhuoning Yuan, Yiming Ying, Tianbao Yang
In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model.
no code implementations • NeurIPS 2018 • Mingrui Liu, Zhe Li, Xiaoyu Wang, Jin-Feng Yi, Tianbao Yang
Negative curvature descent (NCD) method has been utilized to design deterministic or stochastic algorithms for non-convex optimization aiming at finding second-order stationary points or local minima.
no code implementations • NeurIPS 2018 • Xiaoxuan Zhang, Mingrui Liu, Xun Zhou, Tianbao Yang
To advance OFO, we propose an efficient online algorithm based on simultaneously learning a posterior probability of class and learning an optimal threshold by minimizing a stochastic strongly convex function with unknown strong convexity parameter.
no code implementations • 24 Oct 2018 • Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang
In this paper, we consider first-order convergence theory and algorithms for solving a class of non-convex non-concave min-max saddle-point problems, whose objective function is weakly convex in the variables of minimization and weakly concave in the variables of maximization.
no code implementations • 4 Oct 2018 • Hassan Rafique, Mingrui Liu, Qihang Lin, Tianbao Yang
Min-max problems have broad applications in machine learning, including learning with non-decomposable loss and learning with robustness to data distribution.
no code implementations • ICML 2018 • Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, Tianbao Yang
In this paper, we consider statistical learning with AUC (area under ROC curve) maximization in the classical stochastic setting where one random data drawn from an unknown distribution is revealed at each iteration for updating the model.
no code implementations • NeurIPS 2018 • Mingrui Liu, Xiaoxuan Zhang, Lijun Zhang, Rong Jin, Tianbao Yang
Error bound conditions (EBC) are properties that characterize the growth of an objective function when a point is moved away from the optimal set.
no code implementations • NeurIPS 2017 • Mingrui Liu, Tianbao Yang
Recent studies have shown that proximal gradient (PG) method and accelerated gradient method (APG) with restarting can enjoy a linear convergence under a weaker condition than strong convexity, namely a quadratic growth condition (QGC).
no code implementations • NeurIPS 2017 • Yi Xu, Mingrui Liu, Qihang Lin, Tianbao Yang
The novelty of the proposed scheme lies at that it is adaptive to a local sharpness property of the objective function, which marks the key difference from previous adaptive scheme that adjusts the penalty parameter per-iteration based on certain conditions on iterates.
no code implementations • 25 Oct 2017 • Mingrui Liu, Tianbao Yang
In this paper, we study stochastic non-convex optimization with non-convex random functions.
no code implementations • 25 Sep 2017 • Mingrui Liu, Tianbao Yang
To the best of our knowledge, the proposed stochastic algorithm is the first one that converges to a second-order stationary point in {\it high probability} with a time complexity independent of the sample size and almost linear in dimensionality.
no code implementations • 23 Nov 2016 • Mingrui Liu, Tianbao Yang
Recent studies have shown that proximal gradient (PG) method and accelerated gradient method (APG) with restarting can enjoy a linear convergence under a weaker condition than strong convexity, namely a quadratic growth condition (QGC).