# Revisiting the Generalization of Adaptive Gradient Methods

A commonplace belief in the machine learning community is that using adaptive gradient methods hurts generalization. We re-examine this belief both theoretically and experimentally, in light of insights and trends from recent years... We revisit some previous oft-cited experiments and theoretical accounts in more depth, and provide a new set of experiments in larger-scale, state-of-the-art settings. We conclude that with proper tuning, the improved training performance of adaptive optimizers does not in general carry an overfitting penalty, especially in contemporary deep learning. Finally, we synthesize a user's guide'' to adaptive optimizers, including some proposed modifications to AdaGrad to mitigate some of its empirical shortcomings. read more

PDF Abstract

## Code Add Remove Mark official

No code implementations yet. Submit your code now