Domain-independent Dominance of Adaptive Methods

From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling of learning rate and adaptability, greatly simplifying hyperparameter search. In light of this observation, we demonstrate that, against conventional wisdom, Adam can also outperform SGD on vision tasks, as long as the coupling between its learning rate and adaptability is taken into account. In practice, AvaGrad matches the best results, as measured by generalization accuracy, delivered by any existing optimizer (SGD or adaptive) across image classification (CIFAR, ImageNet) and character-level language modelling (Penn Treebank) tasks.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Stochastic Optimization CIFAR-100 WRN-28-10 - 200 Epochs AdaShift Accuracy 81.12 # 2
Stochastic Optimization CIFAR-100 WRN-28-10 - 200 Epochs AdamW Accuracy 79.87 # 5
Stochastic Optimization CIFAR-100 WRN-28-10 - 200 Epochs Adam (eps-adjusted) Accuracy 81.04 # 3
Stochastic Optimization CIFAR-100 WRN-28-10 - 200 Epochs SGD Accuracy 80.95 # 4
Stochastic Optimization CIFAR-100 WRN-28-10 - 200 Epochs AdaBound Accuracy 77.24 # 6
Stochastic Optimization CIFAR-100 WRN-28-10 - 200 Epochs AvaGrad Accuracy 81.24 # 1
Stochastic Optimization CIFAR-10 WRN-28-10 - 200 Epochs AdaShift Accuracy 95.92 # 4
Stochastic Optimization CIFAR-10 WRN-28-10 - 200 Epochs SGD Accuracy 96.14 # 3
Stochastic Optimization CIFAR-10 WRN-28-10 - 200 Epochs AvaGrad Accuracy 96.2 # 2
Stochastic Optimization CIFAR-10 WRN-28-10 - 200 Epochs AdaBound Accuracy 94.6 # 6
Stochastic Optimization CIFAR-10 WRN-28-10 - 200 Epochs AdamW Accuracy 95.89 # 5
Stochastic Optimization CIFAR-10 WRN-28-10 - 200 Epochs Adam (eps-adjusted) Accuracy 96.36 # 1
Stochastic Optimization ImageNet ResNet-50 - 90 Epochs AvaGrad Top 1 Accuracy 76.51 # 1
Stochastic Optimization ImageNet ResNet-50 - 90 Epochs AdaBound Top 1 Accuracy 72.01 # 4
Stochastic Optimization ImageNet ResNet-50 - 90 Epochs SGD Top 1 Accuracy 75.99 # 2
Stochastic Optimization ImageNet ResNet-50 - 90 Epochs AdamW Top 1 Accuracy 72.9 # 3
Stochastic Optimization Penn Treebank (Character Level) 3x1000 LSTM - 500 Epochs AdaBound Bit per Character (BPC) 2.863 # 4
Stochastic Optimization Penn Treebank (Character Level) 3x1000 LSTM - 500 Epochs AdaShift Bit per Character (BPC) 1.274 # 3
Stochastic Optimization Penn Treebank (Character Level) 3x1000 LSTM - 500 Epochs AdamW Bit per Character (BPC) 1.23 # 2
Stochastic Optimization Penn Treebank (Character Level) 3x1000 LSTM - 500 Epochs AvaGrad Bit per Character (BPC) 1.175 # 1

Methods