975 papers with code • 9 benchmarks • 16 datasets

Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).

Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers


Use these libraries to find Quantization models and implementations

Most implemented papers

FastText.zip: Compressing text classification models

facebookresearch/fastText 12 Dec 2016

We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

tensorflow/models CVPR 2018

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

pytorch/fairseq NeurIPS 2020

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

NervanaSystems/distiller 1 Oct 2015

To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

tensorpack/tensorpack 20 Jun 2016

We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients.

Billion-scale similarity search with GPUs

facebookresearch/faiss 28 Feb 2017

Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures.

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

mit-han-lab/once-for-all CVPR 2019

Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

ist-daslab/gptq 31 Oct 2022

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

QLoRA: Efficient Finetuning of Quantized LLMs

artidoro/qlora NeurIPS 2023

Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99. 3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.

GLM-130B: An Open Bilingual Pre-trained Model

thudm/glm-130b 5 Oct 2022

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.