GLM-130B: An Open Bilingual Pre-trained Model

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and disconvergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization, without quantization aware training and with almost no performance loss, making it the first among 100B-scale models. More importantly, the property allows its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most ever affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github.com/THUDM/GLM-130B .

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Language Modelling BIG-bench-lite GLM-130B (0-shot) Accuracy 13.31 # 3
Language Modelling BIG-bench-lite GLM-130B (1-shot) Accuracy 14.91 # 2
Language Modelling BIG-bench-lite GLM-130B (3-shot) Accuracy 15.12 # 1
Language Modelling CLUE (AFQMC) GLM-130B Accuracy 71.2 # 1
Language Modelling CLUE (AFQMC) ERNIE 3.0 Titan-260B Accuracy 69.0 # 2
Language Modelling CLUE (C3) ERNIE 3.0 Titan-260B Accuracy 54.9 # 2
Language Modelling CLUE (C3) GLM-130B Accuracy 77.5 # 1
Language Modelling CLUE (CMNLI) ERNIE 3.0 Titan-260B Accuracy 51.7 # 2
Language Modelling CLUE (CMNLI) GLM-130B Accuracy 77.0 # 1
Language Modelling CLUE (CMRC2018) GLM-130B Accuracy 55.7 # 1
Language Modelling CLUE (CMRC2018) ERNIE 3.0 Titan-260B Accuracy 16.6 # 2
Language Modelling CLUE (DRCD) GLM-130B Accuracy 77.1 # 1
Language Modelling CLUE (DRCD) ERNIE 3.0 Titan-260B Accuracy 29.5 # 2
Language Modelling CLUE (OCNLI_50K) GLM-130B Accuracy 74.7 # 1
Language Modelling CLUE (OCNLI_50K) ERNIE 3.0 Titan-260B Accuracy 44.6 # 2
Language Modelling CLUE (WSC1.1) GLM-130B Accuracy 83.9 # 1
Language Modelling CLUE (WSC1.1) ERNIE 3.0 Titan-260B Accuracy 81.1 # 2
Language Modelling FewCLUE (BUSTM) ERNIE 3.0 Titan-260B Accuracy 64.4 # 2
Language Modelling FewCLUE (BUSTM) GLM-130B Accuracy 77.5 # 1
Language Modelling FewCLUE (CHID-FC) ERNIE 3.0 Titan-260B Accuracy 87.1 # 2
Language Modelling FewCLUE (CHID-FC) GLM-130B Accuracy 90.1 # 1
Language Modelling FewCLUE (CLUEWSC-FC) GLM-130B Accuracy 77.4 # 1
Language Modelling FewCLUE (CLUEWSC-FC) ERNIE 3.0 Titan-260B Accuracy 53.5 # 2
Language Modelling FewCLUE (EPRSTMT) GLM-130B Accuracy 92.5 # 1
Language Modelling FewCLUE (EPRSTMT) ERNIE 3.0 Titan-260B Accuracy 88.8 # 2
Language Modelling FewCLUE (OCNLI-FC) GLM-130B Accuracy 73.8 # 1
Language Modelling FewCLUE (OCNLI-FC) ERNIE 3.0 Titan-260B Accuracy 53.8 # 2
Language Modelling LAMBADA GLM-130B (bidirectional attention) Accuracy 80.2 # 7
Multi-task Language Understanding MMLU GLM-130B Average (%) 44.8 # 26
Language Modelling The Pile GLM-130B Bits per byte 0.634 # 1
Language Modelling The Pile GPT-3 Bits per byte 0.742 # 4
Language Modelling The Pile Jurassic-1 Bits per byte 0.65 # 2

Methods