Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony.
Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates.
Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions.
In this paper we present the greedy step averaging(GSA) method, a parameter-free stochastic optimization algorithm for a variety of machine learning problems.
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence.
#4 best model for Image Classification on CIFAR-100
The wealth of structured (e. g. Wikidata) and unstructured data about the world available today presents an incredible opportunity for tomorrow's Artificial Intelligence.
First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity.
Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.