CPM: A Large-scale Generative Chinese Pre-trained Language Model

Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper


METHOD TYPE
Cosine Annealing
Learning Rate Schedules
Layer Normalization
Normalization
GELU
Activation Functions
Dropout
Regularization
Linear Warmup With Cosine Annealing
Learning Rate Schedules
Strided Attention
Attention Patterns
Residual Connection
Skip Connections
Attention Dropout
Regularization
Weight Decay
Regularization
Fixed Factorized Attention
Attention Patterns
Multi-Head Attention
Attention Modules
Dense Connections
Feedforward Networks
Scaled Dot-Product Attention
Attention Mechanisms
Adam
Stochastic Optimization
Softmax
Output Functions
BPE
Subword Segmentation
GPT-3
Transformers