1 code implementation • 26 Feb 2024 • Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo
SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear.
no code implementations • 23 Oct 2023 • Tao Sun, Congliang Chen, Peng Qiao, Li Shen, Xinwang Liu, Dongsheng Li
Sign-based stochastic methods have gained attention due to their ability to achieve robust performance despite using only the sign information for parameter updates.
no code implementations • 20 Aug 2022 • Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, Zhi-Quan Luo
We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i. e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$.
no code implementations • 28 May 2022 • Congliang Chen, Li Shen, Wei Liu, Zhi-Quan Luo
Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models.
no code implementations • 14 Jan 2021 • Congliang Chen, Li Shen, Fangyu Zou, Wei Liu
Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples.
no code implementations • 29 Apr 2020 • Congliang Chen, Li Shen, Hao-Zhi Huang, Wei Liu
In this paper, we present a distributed variant of adaptive stochastic gradient method for training deep neural networks in the parameter-server model.
no code implementations • 10 Aug 2018 • Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu
Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}.
1 code implementation • CVPR 2018 • Shuyang Gu, Congliang Chen, Jing Liao, Lu Yuan
We theoretically prove that our new style loss based on reshuffle connects both global and local style losses respectively used by most parametric and non-parametric neural style transfer methods.