Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

17 Sep 2019Mohammad ShoeybiMostofa PatwaryRaul PuriPatrick LeGresleyJared CasperBryan Catanzaro

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints... (read more)

PDF Abstract

Results from the Paper


 SOTA for Language Modelling on WikiText-103 (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT LEADERBOARD
Language Modelling WikiText-103 Megatron-LM Test perplexity 10.81 # 1
Number of params 8300M # 1