Inspired by recent research that recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization, we explore this hypothesis in detail.
In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR.
Training neural networks with batch normalization and weight decay has become a common practice in recent years.
Tensor decomposition methods have proven effective in various applications, including compression and acceleration of neural networks.
Reduction of the number of parameters is one of the most important goals in Deep Learning.