Efficient Language Modeling with Sparse all-MLP

14 Mar 2022  ·  Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li ·

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering COPA Base Layers 10B (0-shot) Accuracy 63 # 55
Question Answering COPA HASH Layers 10B (0-shot) Accuracy 64 # 54
Question Answering COPA sMLP – deterministic 9.4B (0-shot) Accuracy 79 # 38
Question Answering COPA Switch Transformer 9B Accuracy 75 # 45
Question Answering COPA Gshard 9B Accuracy 76 # 44
Sentence Completion HellaSwag HASH Layers 10B (0-shot) Accuracy 33 # 79
Sentence Completion HellaSwag Gshard 9B Accuracy 38 # 75
Sentence Completion HellaSwag Switch Transformer 9B Accuracy 52.5 # 60
Sentence Completion HellaSwag sMLP – deterministic 9.4B (0-shot) Accuracy 54.5 # 59
Sentence Completion HellaSwag Base Layers 10B (0-shot) Accuracy 30.2 # 83
Question Answering PIQA sMLP - deterministic 9.4B (0-shot) Accuracy 73 # 46
Question Answering PIQA Gshard 9B Accuracy 68.1 # 54
Question Answering PIQA HASH Layers 10B (0-shot) Accuracy 63.8 # 58
Question Answering PIQA Base Layers 10B (0-shot) Accuracy 63.8 # 58
Common Sense Reasoning ReCoRD Base Layers 10B (0-shot) EM 60.7 # 30
Common Sense Reasoning ReCoRD Gshard 9B EM 72.4 # 24
Common Sense Reasoning ReCoRD sMLP – deterministic 9.4B (0-shot) EM 73.4 # 22
Common Sense Reasoning ReCoRD Switch Transformer 9B EM 79.9 # 19
Common Sense Reasoning ReCoRD HASH Layers 10B (0-shot) EM 67.2 # 28
Question Answering StoryCloze Switch Transformer 9B Accuracy 73.3 # 18
Question Answering StoryCloze sMLP – deterministic 9.4B (0-shot) Accuracy 74.7 # 17
Question Answering StoryCloze Base Layers 10B (0-shot) Accuracy 61.4 # 22
Question Answering StoryCloze HASH Layers 10B (0-shot) Accuracy 64.7 # 21
Question Answering StoryCloze Gshard 9B Accuracy 67.9 # 20
Common Sense Reasoning WinoGrande Base Layers 10B (0-shot) Accuracy 51 # 71
Common Sense Reasoning WinoGrande Switch Transformer 9B (0-shot) Accuracy 53.4 # 65
Common Sense Reasoning WinoGrande Gshard 9B (0-shot) Accuracy 51.1 # 70
Common Sense Reasoning WinoGrande sMLP – deterministic 9.4B (0-shot) Accuracy 54.3 # 64
Common Sense Reasoning WinoGrande HASH Layers 10B (0-shot) Accuracy 51.7 # 69

Methods