TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	PIQA	MT-NLG 530B (0-shot)	Accuracy	82.0	# 18
Reading Comprehension	RACE	Megatron-BERT	Accuracy	89.5	# 4
Reading Comprehension	RACE	Megatron-BERT	Accuracy (High)	88.6	# 3
Reading Comprehension	RACE	Megatron-BERT	Accuracy (Middle)	91.8	# 2
Reading Comprehension	RACE	Megatron-BERT (ensemble)	Accuracy	90.9	# 2
Reading Comprehension	RACE	Megatron-BERT (ensemble)	Accuracy (High)	90.0	# 2
Reading Comprehension	RACE	Megatron-BERT (ensemble)	Accuracy (Middle)	93.1	# 1
Language Modelling	WikiText-103	Megatron-LM	Test perplexity	10.81	# 3
Language Modelling	WikiText-103	Megatron-LM	Number of params	8300M	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/megatron-lm-training-multi-billion-parameter/reading-comprehension-on-race)](https://paperswithcode.com/sota/reading-comprehension-on-race?p=megatron-lm-training-multi-billion-parameter)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/megatron-lm-training-multi-billion-parameter/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=megatron-lm-training-multi-billion-parameter)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/megatron-lm-training-multi-billion-parameter/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=megatron-lm-training-multi-billion-parameter)`

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

17 Sep 2019 · Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Legresley, Jared Casper, Bryan Catanzaro ·

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).

PDF Abstract

Code

Add Remove Mark official

NVIDIA/Megatron-LM official

8,472

PaddlePaddle/PaddleNLP

11,370

kingoflolz/mesh-transformer-jax

↳ Quickstart in

Colab

6,213

facebookresearch/fairscale

2,863

nvidia/transformerengine

1,409

See all 10 implementations

Tasks

Add Remove

LAMBADA

Language Modelling

Question Answering

Reading Comprehension

Datasets

GLUE

SQuAD

WikiText-2

WikiText-103

RACE

PIQA

LAMBADA

CC-Stories

Results from the Paper

Edit

Ranked #2 on Reading Comprehension on RACE

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	PIQA	MT-NLG 530B (0-shot)	Accuracy	82.0	# 18	Compare
Reading Comprehension	RACE	Megatron-BERT	Accuracy	89.5	# 4	Compare
			Accuracy (High)	88.6	# 3	Compare
			Accuracy (Middle)	91.8	# 2	Compare
Reading Comprehension	RACE	Megatron-BERT (ensemble)	Accuracy	90.9	# 2	Compare
			Accuracy (High)	90.0	# 2	Compare
			Accuracy (Middle)	93.1	# 1	Compare
Language Modelling	WikiText-103	Megatron-LM	Test perplexity	10.81	# 3	Compare
Language Modelling	WikiText-103	Megatron-LM	Number of params	8300M	# 3	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BERT • BPE • Cosine Annealing • Dense Connections • Discriminative Fine-Tuning • Dropout • GELU • GPT-2 • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Linear Warmup With Linear Decay • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Weight Decay • WordPiece

Edit Social Preview

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove