HoloFormer: Deep Compression of Pre-Trained Transforms via Unified Optimization of N:M Sparsity and Integer Quantization

29 Sep 2021  ·  Minjia Zhang, Connor Holmes, Yuxiong He, Bo Wu ·

In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many Natural Language Processing (NLP) tasks. However, the huge size of these models brings significant challenges to fine-tuning and online deployment due to latency and cost constraints. Recently, hardware manufacturers have released new architectures that support efficient N:M sparsity and low-precision integer computation for fast inferencing. In contrast to unstructured sparsity, N:M sparsity specifies that out of each chunk of N contiguous weight parameters, exactly M parameters are non-zero. Moreover, these architectures also support processing data with reduced precision, such as INT8. Prior work often considers inducing N:M sparsity and integer quantization in isolation or as independent pieces of a compression pipeline. However, there lacks a systematic investigation towards how N:M sparsity and integer quantization can be effectively combined to exploit the maximum degree of redundancy and enable even faster acceleration for pre-trained Transformer networks. In this work, we propose a unified, systematic approach to learning N:M sparsity and integer quantization for pre-trained Transformers using the Alternating Directions Method of Multipliers (ADMM). We show that both N:M sparsity and integer quantization and their combinations can be framed as non-convex constrained optimization problems and solved in a unified manner. When evaluated across the GLUE suite of NLP benchmarks, our approach outperforms baselines that consider each of these problems independently, retaining 99.4\% accuracy of the dense baseline while being able to execute on newly released hardware effectively.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods