1-bit Adam is a stochastic optimization technique that is a variant of ADAM with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\frac{\text { magnitude of compensated gradient }}{\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \%$ and $94 \%$ compared to the original float 32 and float 16 training, respectively.
Source: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence SpeedPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 5 | 4.95% |
Image Generation | 4 | 3.96% |
2D Object Detection | 3 | 2.97% |
Object Detection | 3 | 2.97% |
Image Classification | 3 | 2.97% |
Retrieval | 3 | 2.97% |
Multi-task Language Understanding | 2 | 1.98% |
Object | 2 | 1.98% |
Image Retrieval | 2 | 1.98% |