TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	C4	Zeropoint LLM.int8 13B (vector-wise + decomp)	Perplexity	12.45	# 2
Language Modelling	C4	LLM.float32 6.7B	Perplexity	13.3	# 5
Language Modelling	C4	LLM.float32 2.7B	Perplexity	14.43	# 6
Language Modelling	C4	LLM.float32 1.3B	Perplexity	15.91	# 9
Linguistic Acceptability	CoLA	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	68.6%	# 17
Semantic Textual Similarity	MRPC	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	91.0%	# 7
Natural Language Inference	MultiNLI	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Matched	90.2	# 10
Natural Language Inference	QNLI	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	94.7%	# 13
Natural Language Inference	RTE	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	85.4%	# 25
Sentiment Analysis	SST-2 Binary classification	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	96.4	# 16
Semantic Textual Similarity	STS Benchmark	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Pearson Correlation	0.919	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/language-modelling-on-c4)](https://paperswithcode.com/sota/language-modelling-on-c4?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/semantic-textual-similarity-on-mrpc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/natural-language-inference-on-multinli)](https://paperswithcode.com/sota/natural-language-inference-on-multinli?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/linguistic-acceptability-on-cola)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola?p=llm-int8-8-bit-matrix-multiplication-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llm-int8-8-bit-matrix-multiplication-for/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=llm-int8-8-bit-matrix-multiplication-for)`

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

15 Aug 2022 · Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer ·

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.

PDF Abstract

Code

Add Remove Mark official

huggingface/transformers-bloom-infe…

547

kohjingyu/fromage

↳ Quickstart in

Spaces

455

alextmallen/adaptive-retrieval

142

Tasks

Add Remove

Language Modelling

Linguistic Acceptability

Natural Language Inference

Quantization

Semantic Textual Similarity

Sentiment Analysis

Datasets

GLUE

SST

MultiNLI SST-2

QNLI

MRPC

CoLA RTE STS Benchmark

Results from the Paper

Edit

Ranked #2 on Language Modelling on C4

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	C4	Zeropoint LLM.int8 13B (vector-wise + decomp)	Perplexity	12.45	# 2	Compare
Language Modelling	C4	LLM.float32 6.7B	Perplexity	13.3	# 5	Compare
Language Modelling	C4	LLM.float32 2.7B	Perplexity	14.43	# 6	Compare
Language Modelling	C4	LLM.float32 1.3B	Perplexity	15.91	# 9	Compare
Linguistic Acceptability	CoLA	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	68.6%	# 17	Compare
Semantic Textual Similarity	MRPC	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	91.0%	# 7	Compare
Natural Language Inference	MultiNLI	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Matched	90.2	# 10	Compare
Natural Language Inference	QNLI	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	94.7%	# 13	Compare
Natural Language Inference	RTE	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	85.4%	# 25	Compare
Sentiment Analysis	SST-2 Binary classification	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Accuracy	96.4	# 16	Compare
Semantic Textual Similarity	STS Benchmark	RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned)	Pearson Correlation	0.919	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove