1-bit Adam is a stochastic optimization technique that is a variant of ADAM with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\frac{\text { magnitude of compensated gradient }}{\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \%$ and $94 \%$ compared to the original float 32 and float 16 training, respectively.
Source: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence SpeedPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 5 | 7.69% |
Retrieval | 3 | 4.62% |
Image Generation | 3 | 4.62% |
Image Classification | 2 | 3.08% |
Image Retrieval | 2 | 3.08% |
Language Modelling | 2 | 3.08% |
Large Language Model | 2 | 3.08% |
Text-to-Image Generation | 2 | 3.08% |
Video Generation | 2 | 3.08% |