AdaGrad

AdaGrad is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta_{i}$ based on the past gradients for $\theta_{i}$:

$$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}}g_{t, i} $$

The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.

Image: Alec Radford

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	13	11.71%
BIG-bench Machine Learning	7	6.31%
Image Classification	6	5.41%
Text Generation	4	3.60%
Translation	4	3.60%
Self-Supervised Learning	3	2.70%
Federated Learning	3	2.70%
Machine Translation	3	2.70%
Continual Learning	2	1.80%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Stochastic Optimization

Large Batch Optimization