Quantization
1032 papers with code • 10 benchmarks • 18 datasets
Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).
Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers
Libraries
Use these libraries to find Quantization models and implementationsDatasets
Latest papers
Efficient Multi-Vector Dense Retrieval Using Bit Vectors
This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval.
Minimize Quantization Output Error with Bias Compensation
Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment.
Transformer based Pluralistic Image Completion with Reduced Information Loss
The indices of quantized pixels are used as tokens for the inputs and prediction targets of the transformer.
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits.
Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers
Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs.
QNCD: Quantization Noise Correction for Diffusion Models
Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity.
The Unreasonable Ineffectiveness of the Deeper Layers
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity.
AffineQuant: Affine Transformation Quantization for Large Language Models
Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training.
NoisyDECOLLE: Robust Local Learning for SNNs on Neuromorphic Hardware
However, mapping these algorithms to neuromorphic systems to unleash their potential can be impaired by various kinds of noise.