HoloFormer: Deep Compression of Pre-Trained Transforms via Unified Optimization of N:M Sparsity and Integer Quantization

29 Sep 2021 · Minjia Zhang, Connor Holmes, Yuxiong He, Bo Wu ·

In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many Natural Language Processing (NLP) tasks. However, the huge size of these models brings significant challenges to fine-tuning and online deployment due to latency and cost constraints. Recently, hardware manufacturers have released new architectures that support efficient N:M sparsity and low-precision integer computation for fast inferencing. In contrast to unstructured sparsity, N:M sparsity specifies that out of each chunk of N contiguous weight parameters, exactly M parameters are non-zero. Moreover, these architectures also support processing data with reduced precision, such as INT8. Prior work often considers inducing N:M sparsity and integer quantization in isolation or as independent pieces of a compression pipeline. However, there lacks a systematic investigation towards how N:M sparsity and integer quantization can be effectively combined to exploit the maximum degree of redundancy and enable even faster acceleration for pre-trained Transformer networks. In this work, we propose a unified, systematic approach to learning N:M sparsity and integer quantization for pre-trained Transformers using the Alternating Directions Method of Multipliers (ADMM). We show that both N:M sparsity and integer quantization and their combinations can be framed as non-convex constrained optimization problems and solved in a unified manner. When evaluated across the GLUE suite of NLP benchmarks, our approach outperforms baselines that consider each of these problems independently, retaining 99.4\% accuracy of the dense baseline while being able to execute on newly released hardware effectively.

PDF Abstract