GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications:
Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.
A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.
The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.
Paper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 159 | 19.25% |
Text Generation | 83 | 10.05% |
Sentence | 34 | 4.12% |
Decoder | 31 | 3.75% |
Question Answering | 24 | 2.91% |
Retrieval | 18 | 2.18% |
Large Language Model | 17 | 2.06% |
In-Context Learning | 13 | 1.57% |
Decision Making | 10 | 1.21% |