In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance.
Standard first-order stochastic optimization algorithms base their updates solely on the average mini-batch gradient, and it has been shown that tracking additional quantities such as the curvature can help de-sensitize common hyperparameters.
In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent.
This arises when an approximate gradient is easier to compute than the full gradient (e. g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e. g. in certain reinforcement learning applications or training networks with discrete variables).
We propose Guided Evolutionary Strategies, a method for optimally using surrogate gradient directions along with random search.
Gradient-based optimization is the foundation of deep learning and reinforcement learning.