Transformers

GPT-2

Introduced by Radford et al. in Language Models are Unsupervised Multitask Learners

GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications:

  • Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.

  • A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.

  • The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.

Source: Language Models are Unsupervised Multitask Learners

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 159 19.25%
Text Generation 83 10.05%
Sentence 34 4.12%
Decoder 31 3.75%
Question Answering 24 2.91%
Retrieval 18 2.18%
Large Language Model 17 2.06%
In-Context Learning 13 1.57%
Decision Making 10 1.21%

Categories