Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$\alpha$, with up to 200 billion parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$\alpha$ in performing various tasks under few-shot or zero-shot settings.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning (One-Shot) C3 PanGu-α 2.6B Accuracy 52.82 # 1
Common Sense Reasoning (Zero-Shot) C3 PanGu-α 2.6B Accuracy 53.42 # 1
Common Sense Reasoning (Few-Shot) C3 PanGu-α 2.6B Accuracy 53.64 # 1
Cloze (multi-choices) (Few-Shot) ChID PanGu-α 2.6B Accuracy 66.56 # 1
Cloze (multi-choices) (Zero-Shot) ChID PanGu-α 2.6B Accuracy 68.73 # 1
Cloze (multi-choices) (One-Shot) ChID PanGu-α 2.6B Accuracy 68.16 # 1
Cloze (multi-choices) (One-Shot) CMRC 2017 PanGu-α 2.6B Accuracy 38.0 # 1
Cloze (multi-choices) (Zero-Shot) CMRC 2017 PanGu-α 2.6B Accuracy 37.83 # 1
Cloze (multi-choices) (Few-Shot) CMRC 2017 PanGu-α 2.6B Accuracy 36.33 # 1
Reading Comprehension (One-Shot) CMRC 2018 PanGu-α 2.6B F1 18.57 # 1
EM 2.49 # 1
Reading Comprehension (Few-Shot) CMRC 2018 PanGu-α 2.6B F1 23.22 # 1
EM 5.68 # 1
Reading Comprehension (Zero-Shot) CMRC 2018 PanGu-α 2.6B F1 16.647 # 1
EM 1.21 # 1
Cloze (multi-choices) (Few-Shot) CMRC 2019 PanGu-α 2.6B Accuracy 62.42 # 1
Cloze (multi-choices) (One-Shot) CMRC 2019 PanGu-α 2.6B Accuracy 61.54 # 1
Cloze (multi-choices) (Zero-Shot) CMRC 2019 PanGu-α 2.6B Accuracy 61.93 # 1
Reading Comprehension (Few-Shot) DRCD PanGu-α 2.6B EM 5.31 # 1
F1 18.29 # 1
Reading Comprehension (One-Shot) DRCD PanGu-α 2.6B EM 2.47 # 1
F1 12.58 # 1
Reading Comprehension (Zero-Shot) DRCD PanGu-α 2.6B EM 0.8 # 1
F1 9.99 # 1
Reading Comprehension (One-Shot) DuReader PanGu-α 2.6B ROUGE-1 20.18 # 1
Reading Comprehension (Zero-Shot) DuReader PanGu-α 2.6B ROUGE-1 21.07 # 1
Reading Comprehension (Few-Shot) DuReader PanGu-α 2.6B ROUGE-1 21.43 # 1
Natural Language Inference (One-Shot) OCNLI PanGu-α 2.6B Accuracy 44.0 # 1
Natural Language Inference (Few-Shot) OCNLI PanGu-α 2.6B Accuracy 46.78 # 1
Natural Language Inference (Zero-Shot) OCNLI PanGu-α 2.6B Accuracy 42.61 # 1

Methods