Model Compression is an actively pursued area of research over the last few years with the goal of deploying state-of-the-art deep networks in low-power and resource limited devices without significant drop in accuracy. Parameter pruning, low-rank factorization and weight quantization are some of the proposed methods to compress the size of deep networks.
Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training.
KNOWLEDGE DISTILLATION LANGUAGE MODELLING MODEL COMPRESSION SENTIMENT ANALYSIS
However, this measure of performance conceals significant differences in how different classes and images are impacted by model compression techniques.
FAIRNESS INTERPRETABILITY TECHNIQUES FOR DEEP LEARNING MODEL COMPRESSION NETWORK PRUNING OUTLIER DETECTION QUANTIZATION
We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
(2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car.
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning.
Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets.
To achieve faster speeds and to handle the problems caused by the lack of data, knowledge distillation (KD) has been proposed to transfer information learned from one model to another.
We demonstrate that this objective ignores important structural knowledge of the teacher network.
Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices.