AdaGrad

AdaGrad is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta_{i}$ based on the past gradients for $\theta_{i}$:

$$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}}g_{t, i} $$

The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.

Image: Alec Radford

Latest Papers

PAPER DATE
Dimension Independence in Unconstrained Private ERM via Adaptive Preconditioning
Peter KairouzMónica RiberoKeith RushAbhradeep Thakurta
2020-08-14
A High Probability Analysis of Adaptive SGD with Momentum
Xiaoyu LiFrancesco Orabona
2020-07-28
Corner Proposal Network for Anchor-free, Two-stage Object Detection
| Kaiwen DuanLingxi XieHonggang QiSong BaiQingming HuangQi Tian
2020-07-27
Adaptive Gradient Methods for Constrained Convex Optimization
Alina EneHuy L. NguyenAdrian Vladu
2020-07-17
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs
Xunpeng HuangHao ZhouRunxin XuZhe WangLei Li
2020-06-12
Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)
Sharan VaswaniFrederik KunstnerIssam LaradjiSi Yi MengMark SchmidtSimon Lacoste-Julien
2020-06-11
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
| Zhewei YaoAmir GholamiSheng ShenKurt KeutzerMichael W. Mahoney
2020-06-01
On the Convergence of Adam and Adagrad
Alexandre DéfossezLéon BottouFrancis BachNicolas Usunier
2020-03-05
Stagewise Enlargement of Batch Size for SGD-based Learning
Shen-Yi ZhaoYin-Peng XieWu-Jun Li
2020-02-26
Adaptive Online Learning with Varying Norms
Ashok Cutkosky
2020-02-10
Revisiting the Generalization of Adaptive Gradient Methods
Naman AgarwalRohan AnilElad HazanTomer KorenCyril Zhang
2020-01-01
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets
Mingrui LiuYoussef MrouehJerret RossWei ZhangXiaodong CuiPayel DasTianbao Yang
2019-12-26
Second-order Information in First-order Optimization Methods
| Yuzheng HuLicong LinShange Tang
2019-12-20
Parameter Continuation Methods for the Optimization of Deep Neural Networks
| Harsh Nilesh PathankRandy Clinton Paffenroth
2019-12-16
Memory Efficient Adaptive Optimization
| Rohan AnilVineet GuptaTomer KorenYoram Singer
2019-12-01
Adaptive Gradient Descent for Convex and Non-Convex Stochastic Optimization
Aleksandr OgaltsovDarina DvinskikhPavel DvurechenskyAlexander GasnikovVladimir Spokoiny
2019-11-19
An Adaptive and Momental Bound Method for Stochastic Learning
| Jianbang DingXuancheng RenRuixuan LuoXu Sun
2019-10-27
Implementation of a modified Nesterov's Accelerated quasi-Newton Method on Tensorflow
S. IndrapriyadarsiniShahrzad MahboubiHiroshi NinomiyaHideki Asai
2019-10-21
Adaptive Step Sizes in Variance Reduction via Regularization
Bingcong LiGeorgios B. Giannakis
2019-10-15
diffGrad: An Optimization Method for Convolutional Neural Networks
| Shiv Ram DubeySoumendu ChakrabortySwalpa Kumar RoySnehasis MukherjeeSatish Kumar SinghBidyut Baran Chaudhuri
2019-09-12
CTRL: A Conditional Transformer Language Model for Controllable Generation
| Nitish Shirish KeskarBryan McCannLav R. VarshneyCaiming XiongRichard Socher
2019-09-11
Meta-descent for Online, Continual Prediction
Andrew JacobsenMatthew SchlegelCameron LinkeThomas DegrisAdam WhiteMartha White
2019-07-17
Augmenting Self-attention with Persistent Memory
| Sainbayar SukhbaatarEdouard GraveGuillaume LampleHerve JegouArmand Joulin
2019-07-02
Adaptively Preconditioned Stochastic Gradient Langevin Dynamics
| Chandrasekaran Anirudh Bhardwaj
2019-06-10
The Implicit Bias of AdaGrad on Separable Data
Qian QianXiaoyuan Qian
2019-06-09
AdaOja: Adaptive Learning Rates for Streaming PCA
| Amelia HenriksenRachel Ward
2019-05-28
Hyper-Regularization: An Adaptive Choice for the Learning Rate in Gradient Descent
Guangzeng XieHao JinDachao LinZhihua Zhang
2019-05-01
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
| Liangchen LuoYuanhao XiongYan LiuXu Sun
2019-02-26
Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network
Xiaoxia WuSimon S. DuRachel Ward
2019-02-19
A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise
Francis BachKfir Y. Levy
2019-02-05
Compressing Gradient Optimizers via Count-Sketches
| Ryan SpringAnastasios KyrillidisVijai MohanAnshumali Shrivastava
2019-02-01
Memory-Efficient Adaptive Optimization
| Rohan AnilVineet GuptaTomer KorenYoram Singer
2019-01-30
A Sufficient Condition for Convergences of Adam and RMSProp
Fangyu ZouLi ShenZequn JieWeizhong ZhangWei Liu
2018-11-23
Practical Bayesian Learning of Neural Networks via Adaptive Optimisation Methods
Samuel KesslerArnold SalasVincent W. C. TanStefan ZohrenStephen Roberts
2018-11-08
Riemannian Adaptive Optimization Methods
Gary BécigneulOctavian-Eugen Ganea
2018-10-01
Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions
Zaiyi ChenZhuoning YuanJinfeng YiBowen ZhouEnhong ChenTianbao Yang
2018-08-20
Weighted AdaGrad with Unified Momentum
Fangyu ZouLi ShenZequn JieJu SunWei Liu
2018-08-10
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
Xiangyi ChenSijia LiuRuoyu SunMingyi Hong
2018-08-08
SADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi ChenYi XuEnhong ChenTianbao Yang
2018-07-01
AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization
Rachel WardXiaoxia WuLeon Bottou
2018-06-05
On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes
Xiaoyu LiFrancesco Orabona
2018-05-21
Block Mean Approximation for Efficient Second Order Optimization
Yao LuMehrtash HarandiRichard HartleyRazvan Pascanu
2018-04-16
Shampoo: Preconditioned Stochastic Tensor Optimization
| Vineet GuptaTomer KorenYoram Singer
2018-02-26
LSH-SAMPLING BREAKS THE COMPUTATIONAL CHICKEN-AND-EGG LOOP IN ADAPTIVE STOCHASTIC GRADIENT ESTIMATION
Beidi ChenYingchen XuAnshumali Shrivastava
2018-01-01
Improving Generalization Performance by Switching from Adam to SGD
| Nitish Shirish KeskarRichard Socher
2017-12-20
AdaBatch: Efficient Gradient Aggregation Rules for Sequential and Parallel Stochastic Gradient Methods
Alexandre DéfossezFrancis Bach
2017-11-06
Why ADAGRAD Fails for Online Topic Modeling
You LuJeffrey LundJordan Boyd-Graber
2017-09-01
A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization
Vineet GuptaTomer KorenYoram Singer
2017-06-20
YellowFin and the Art of Momentum Tuning
| Jian ZhangIoannis Mitliagkas
2017-06-12
The Marginal Value of Adaptive Gradient Methods in Machine Learning
| Ashia C. WilsonRebecca RoelofsMitchell SternNathan SrebroBenjamin Recht
2017-05-23
Efficient Parallel Translating Embedding For Knowledge Graphs
| Denghui ZhangManling LiYantao JiaYuanzhuo WangXueqi Cheng
2017-03-30
Improving Neural Language Models with a Continuous Cache
| Edouard GraveArmand JoulinNicolas Usunier
2016-12-13
Scalable Adaptive Stochastic Optimization Using Random Projections
Gabriel KrummenacherBrian McWilliamsYannic KilcherJoachim M. BuhmannNicolai Meinshausen
2016-11-21
Relativistic Monte Carlo
Xiaoyu LuValerio PerroneLeonard HasencleverYee Whye TehSebastian J. Vollmer
2016-09-14
CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method
Nishant A. MehtaAlistair RendellAnish VargheseChristfried Webers
2016-09-12
Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
| Changyou ChenDavid CarlsonZhe GanChunyuan LiLawrence Carin
2015-12-25
Speed learning on the fly
Pierre-Yves MasséYann Ollivier
2015-11-08
adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs
Nitish Shirish KeskarAlbert S. Berahas
2015-11-04
Path-SGD: Path-Normalized Optimization in Deep Neural Networks
| Behnam NeyshaburRuslan SalakhutdinovNathan Srebro
2015-06-08
Dropout Training as Adaptive Regularization
Stefan WagerSida WangPercy Liang
2013-07-04

Tasks

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories