no code implementations • 23 Feb 2024 • Kento Imaizumi, Hideaki Iiduka
In particular, the previous numerical results indicated that, for SGD using a constant learning rate, the number of iterations needed for training decreases when the batch size increases, and the SFO complexity needed for training is minimized at a critical batch size and that it increases once the batch size exceeds that size.
no code implementations • 4 Feb 2024 • Naoki Sato, Hideaki Iiduka
While stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, a theoretical explanation for this is lacking.
no code implementations • 15 Nov 2023 • Naoki Sato, Hideaki Iiduka
The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies.
no code implementations • 25 Jul 2023 • Yuki Tsukada, Hideaki Iiduka
Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases.
no code implementations • 21 Aug 2022 • Hideaki Iiduka
That is, the numerical results indicate that Adam using a small constant learning rate, hyperparameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and stochastic gradient descent (SGD).
no code implementations • 27 Jun 2022 • Hideaki Iiduka
Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic.
1 code implementation • 28 Mar 2022 • Hiroki Naganuma, Hideaki Iiduka
Since data distribution is unknown, generative adversarial networks (GANs) formulate this problem as a game between two models, a generator and a discriminator.
1 code implementation • 28 Jan 2022 • Naoki Sato, Hideaki Iiduka
Previous results have shown that a two time-scale update rule (TTUR) using different learning rates, such as different constant rates or different decaying rates, is useful for training generative adversarial networks (GANs) in theory and in practice.
no code implementations • 14 Dec 2021 • Hideaki Iiduka
Numerical evaluations have definitively shown that, for deep learning optimizers such as stochastic gradient descent, momentum, and adaptive methods, the number of steps needed to train a deep neural network halves for each doubling of the batch size and that there is a region of diminishing returns beyond the critical batch size.
no code implementations • 26 Aug 2021 • Hideaki Iiduka
In particular, it is shown theoretically that momentum and Adam-type optimizers can exploit larger optimal batches and further reduce the minimum number of steps needed for nonconvex optimization than can the stochastic gradient descent optimizer.
no code implementations • 2 Apr 2020 • Hiroyuki Sakai, Hideaki Iiduka
This paper proposes a Riemannian adaptive optimization algorithm to optimize the parameters of deep neural networks.
Stochastic Optimization Optimization and Control 65k05, 90C25, 57R25 G.1.6
1 code implementation • 29 Feb 2020 • Yu Kobayashi, Hideaki Iiduka
This paper proposes a conjugate-gradient-based Adam algorithm blending Adam with nonlinear conjugate gradient methods and shows its convergence analysis.