Quantization
1003 papers with code • 9 benchmarks • 17 datasets
Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).
Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers
Libraries
Use these libraries to find Quantization models and implementationsDatasets
Latest papers
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity.
AffineQuant: Affine Transformation Quantization for Large Language Models
Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training.
Self-Supervised Quantization-Aware Knowledge Distillation
Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models.
TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks
It achieves 11 SotA results, 4 top-2 results out of 16 tasks for the Taxonomy Enrichment, Hypernym Discovery, Taxonomy Construction, and Lexical Entailment tasks.
Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency
We present experiments on two benchmark datasets showing that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models, while also improving computational efficiency.
Chronos: Learning the Language of Time Series
We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models.
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference.
Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference
Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training.
QAQ: Quality Adaptive Quantization for LLM KV Cache
The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation.
Behavior Generation with Latent Actions
Unlike language or image generation, decision making requires modeling actions - continuous-valued vectors that are multimodal in their distribution, potentially drawn from uncurated sources, where generation errors can compound in sequential prediction.