Joint Descent: Training and Tuning Simultaneously

1 Jan 2021  ·  Qiuyi Zhang ·

Typically in machine learning, training and tuning are done in an alternating manner: for a fixed set of hyperparameters $y$, we apply gradient descent to our objective $f(x, y)$ over trainable variables $x$ until convergence; then, we apply a tuning step over $y$ to find another promising setting of hyperparameters. Because the full training cycle is completed before a tuning step is applied, the optimization procedure greatly emphasizes the gradient step, which seems justified as first-order methods provides a faster convergence rate. In this paper, we argue that an equal emphasis on training and tuning lead to faster convergence both theoretically and empirically. We present Joint Descent (JD) and a novel theoretical analysis of acceleration via an unbiased gradient estimate to give an optimal iteration complexity of $O(\sqrt{\kappa}n_y\log(n/\epsilon))$, where $\kappa$ is the condition number and $n_y$ is the dimension of $y$. This provably improves upon the naive classical bound and implies that we essentially train for free if we apply equal emphasis on training and tuning steps. Empirically, we observe that an unbiased gradient estimate achieves the best convergence results, supporting our theory.

PDF Abstract
No code implementations yet. Submit your code now



  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here