Search Results for author: Congliang Chen

Found 8 papers, 2 papers with code

Why Transformers Need Adam: A Hessian Perspective

1 code implementation26 Feb 2024 Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo

SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear.

Rethinking SIGN Training: Provable Nonconvex Acceleration without First- and Second-Order Gradient Lipschitz

no code implementations23 Oct 2023 Tao Sun, Congliang Chen, Peng Qiao, Li Shen, Xinwang Liu, Dongsheng Li

Sign-based stochastic methods have gained attention due to their ability to achieve robust performance despite using only the sign information for parameter updates.

Adam Can Converge Without Any Modification On Update Rules

no code implementations20 Aug 2022 Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, Zhi-Quan Luo

We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i. e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$.

Efficient-Adam: Communication-Efficient Distributed Adam

no code implementations28 May 2022 Congliang Chen, Li Shen, Wei Liu, Zhi-Quan Luo

Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models.

Quantization

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

no code implementations14 Jan 2021 Congliang Chen, Li Shen, Fangyu Zou, Wei Liu

Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples.

Stochastic Optimization

Quantized Adam with Error Feedback

no code implementations29 Apr 2020 Congliang Chen, Li Shen, Hao-Zhi Huang, Wei Liu

In this paper, we present a distributed variant of adaptive stochastic gradient method for training deep neural networks in the parameter-server model.

Quantization

A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration

no code implementations10 Aug 2018 Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu

Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}.

Stochastic Optimization

Arbitrary Style Transfer with Deep Feature Reshuffle

1 code implementation CVPR 2018 Shuyang Gu, Congliang Chen, Jing Liao, Lu Yuan

We theoretically prove that our new style loss based on reshuffle connects both global and local style losses respectively used by most parametric and non-parametric neural style transfer methods.

Style Transfer

Cannot find the paper you are looking for? You can Submit a new open access paper.