# Stochastic Optimization

301 papers with code • 12 benchmarks • 11 datasets

**Stochastic Optimization** is the task of optimizing certain objective functional by generating and using stochastic random variables. Usually the Stochastic Optimization is an iterative process of generating random variables that progressively finds out the minima or the maxima of the objective functional. Stochastic Optimization is usually applied in the non-convex functional spaces where the usual deterministic optimization such as linear or quadratic programming or their variants cannot be used.

Source: ASOC: An Adaptive Parameter-free Stochastic Optimization Techinique for Continuous Variables

## Libraries

Use these libraries to find Stochastic Optimization models and implementations## Datasets

## Most implemented papers

# Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.

# Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training.

# Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches.

# On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam.

# Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms.

# Optimizing Neural Networks with Kronecker-factored Approximate Curvature

This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

# SGDR: Stochastic Gradient Descent with Warm Restarts

Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions.

# Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence.

# Deep learning with Elastic Averaging SGD

We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance.

# Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups.