AdaBound

Introduced by Luo et al. in Adaptive Gradient Methods with Dynamic Bound of Learning Rate

AdaBound is a variant of the Adam stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AdaBound can be regarded as an adaptive method at the beginning of training, and thereafter it gradually and smoothly transforms to SGD (or with momentum) as the time step increases.

$$ g_{t} = \nabla{f}_{t}\left(x_{t}\right) $$

$$ m_{t} = \beta_{1t}m_{t-1} + \left(1-\beta_{1t}\right)g_{t} $$

$$ v_{t} = \beta_{2}v_{t-1} + \left(1-\beta_{2}\right)g_{t}^{2} \text{ and } V_{t} = \text{diag}\left(v_{t}\right) $$

$$ \hat{\eta}_{t} = \text{Clip}\left(\alpha/\sqrt{V_{t}}, \eta_{l}\left(t\right), \eta_{u}\left(t\right)\right) \text{ and } \eta_{t} = \hat{\eta}_{t}/\sqrt{t} $$

$$ x_{t+1} = \Pi_{\mathcal{F}, \text{diag}\left(\eta_{t}^{-1}\right)}\left(x_{t} - \eta_{t} \odot m_{t} \right) $$

Where $\alpha$ is the initial step size, and $\eta_{l}$ and $\eta_{u}$ are the lower and upper bound functions respectively.

Source: Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Vocal Bursts Type Prediction	1	8.33%
Bilevel Optimization	1	8.33%
Benchmarking	1	8.33%
General Classification	1	8.33%
Management	1	8.33%
Text Categorization	1	8.33%
Text Classification	1	8.33%
Clustering	1	8.33%
Image Classification	1	8.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Stochastic Optimization