A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties. In this paper, we address this problem by presenting a statistical framework for analyzing FQT algorithms. We view the quantized gradient of FQT as a stochastic estimator of its full precision counterpart, a procedure known as quantization-aware training (QAT). We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance. Inspired by these theoretical results, we develop two novel gradient quantizers, and we show that these have smaller variance than the existing per-tensor quantizer. For training ResNet-50 on ImageNet, our 5-bit block Householder quantizer achieves only 0.5% validation accuracy loss relative to QAT, comparable to the existing INT8 baseline.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Linguistic Acceptability CoLA PSQ (Chen et al., 2020) Accuracy 67.5 # 21
Semantic Textual Similarity MRPC PSQ (Chen et al., 2020) Accuracy 90.4 # 13
Natural Language Inference MultiNLI PSQ (Chen et al., 2020) Matched 89.9 # 11
Natural Language Inference QNLI PSQ (Chen et al., 2020) Accuracy 94.5 # 15
Natural Language Inference RTE PSQ (Chen et al., 2020) Accuracy 86.8 # 23
Sentiment Analysis SST-2 Binary classification PSQ (Chen et al., 2020) Accuracy 96.2 # 19
Semantic Textual Similarity STS Benchmark PSQ (Chen et al., 2020) Pearson Correlation 0.919 # 9

Methods


No methods listed for this paper. Add relevant methods here