Large Batch Optimization

Introduced by Tang et al. in 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

1-bit Adam is a stochastic optimization technique that is a variant of ADAM with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\frac{\text { magnitude of compensated gradient }}{\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \%$ and $94 \%$ compared to the original float 32 and float 16 training, respectively.

#### Papers

Paper Code Results Date Stars

Component Type