PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

26 Apr 2021  ·  Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, Yonghong Tian ·

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning... In this work, we present our practice on training large-scale autoregressive language models named PanGu-$\alpha$, with up to 200 billion parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$\alpha$ in performing various tasks under few-shot or zero-shot settings. read more

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning (Few-Shot) C3 PanGu-α 2.6B Accuracy 53.64 # 1
Common Sense Reasoning (One-Shot) C3 PanGu-α 2.6B Accuracy 52.82 # 1
Common Sense Reasoning (Zero-Shot) C3 PanGu-α 2.6B Accuracy 53.42 # 1
Cloze (multi-choices) (Zero-Shot) ChID PanGu-α 2.6B Accuracy 68.73 # 1
Cloze (multi-choices) (One-Shot) ChID PanGu-α 2.6B Accuracy 68.16 # 1
Cloze (multi-choices) (Few-Shot) ChID PanGu-α 2.6B Accuracy 66.56 # 1
Cloze (multi-choices) (Zero-Shot) CMRC 2017 PanGu-α 2.6B Accuracy 37.83 # 1
Cloze (multi-choices) (One-Shot) CMRC 2017 PanGu-α 2.6B Accuracy 38.0 # 1
Cloze (multi-choices) (Few-Shot) CMRC 2017 PanGu-α 2.6B Accuracy 36.33 # 1
Reading Comprehension (One-Shot) CMRC 2018 PanGu-α 2.6B F1 18.57 # 1
EM 2.49 # 1
Reading Comprehension (Few-Shot) CMRC 2018 PanGu-α 2.6B F1 23.22 # 1
EM 5.68 # 1
Reading Comprehension (Zero-Shot) CMRC 2018 PanGu-α 2.6B F1 16.647 # 1
EM 1.21 # 1
Cloze (multi-choices) (Zero-Shot) CMRC 2019 PanGu-α 2.6B Accuracy 61.93 # 1
Cloze (multi-choices) (One-Shot) CMRC 2019 PanGu-α 2.6B Accuracy 61.54 # 1
Cloze (multi-choices) (Few-Shot) CMRC 2019 PanGu-α 2.6B Accuracy 62.42 # 1
Reading Comprehension (Few-Shot) DRCD PanGu-α 2.6B EM 5.31 # 1
F1 18.29 # 1
Reading Comprehension (One-Shot) DRCD PanGu-α 2.6B EM 2.47 # 1
F1 12.58 # 1
Reading Comprehension (Zero-Shot) DRCD PanGu-α 2.6B EM 0.8 # 1
F1 9.99 # 1
Reading Comprehension (One-Shot) DuReader PanGu-α 2.6B ROUGE-1 20.18 # 1
Reading Comprehension (Few-Shot) DuReader PanGu-α 2.6B ROUGE-1 21.43 # 1
Reading Comprehension (Zero-Shot) DuReader PanGu-α 2.6B ROUGE-1 21.07 # 1
Natural Language Inference (One-Shot) OCNLI PanGu-α 2.6B Accuracy 44.0 # 1
Natural Language Inference (Zero-Shot) OCNLI PanGu-α 2.6B Accuracy 42.61 # 1
Natural Language Inference (Few-Shot) OCNLI PanGu-α 2.6B Accuracy 46.78 # 1

Methods