With the dramatically increased number of parameters in language models, sparsity methods have received ever-increasing research focus to compress and accelerate the models.
This practically limits the application of model compression when the model needs to be deployed on a wide range of devices.
Motivated by the necessity of efficient inference across various constraints on BERT, we propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
We also provide a workflow of filter rearrangement that first rearranges the weight matrix in the output channel dimension to derive more influential blocks for accuracy improvements and then applies similar rearrangement to the next-layer weights in the input channel dimension to ensure correct convolutional operations.
Channel pruning and tensor decomposition have received extensive attention in convolutional neural network compression.
Specifically, most state-of-the-art SR models without batch normalization have a large dynamic quantization range, which also serves as another cause of performance drop.
The intensive computation of Automatic Speech Recognition (ASR) models obstructs them from being deployed on mobile devices.
The operator is known to be well-posed for problems with finite states, but our analysis shows that it is also well-defined for the contractive models with infinite states studied.
More specifically, we introduce a novel architecture controlling module in each layer to encode the network architecture by a vector.
In this paper, we propose a novel filter pruning scheme, termed structured sparsity regularization (SSR), to simultaneously speedup the computation and reduce the memory overhead of CNNs, which can be well supported by various off-the-shelf deep learning libraries.
The relationship between the input feature maps and 2D kernels is revealed in a theoretical framework, based on which a kernel sparsity and entropy (KSE) indicator is proposed to quantitate the feature map importance in a feature-agnostic manner to guide model compression.