A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes. Our results establish new, stronger baselines for future comparisons at these batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Datasets


Results from the Paper


 Ranked #1 on Question Answering on SQuAD1.1 (Hardware Burden metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet ResNet-50 MLPerf v0.7 - 2512 steps Top 1 Accuracy 75.92% # 532
Hardware Burden None # 1
Operations per network pass None # 1
Question Answering SQuAD1.1 BERT-Large 32k batch size with AdamW F1 91.58 # 30
Hardware Burden None # 1
Operations per network pass None # 1

Methods