Model Compression
401 papers with code • 2 benchmarks • 3 datasets
Model Compression is an actively pursued area of research over the last few years with the goal of deploying state-of-the-art deep networks in low-power and resource limited devices without significant drop in accuracy. Parameter pruning, low-rank factorization and weight quantization are some of the proposed methods to compress the size of deep networks.
Libraries
Use these libraries to find Model Compression models and implementationsMost implemented papers
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
(2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets.
Learned Step Size Quantization
Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases.
The State of Sparsity in Deep Neural Networks
We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet.
Ternary Weight Networks
We present a memory and computation efficient ternary weight networks (TWNs) - with weights constrained to +1, 0 and -1.
Model compression via distillation and quantization
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning.
To prune, or not to prune: exploring the efficacy of pruning for model compression
Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model.
Global Sparse Momentum SGD for Pruning Very Deep Neural Networks
Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices.