Introduced by Brown et al. in Language Models are Few-Shot Learners

GPT-3 is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

Source: Language Models are Few-Shot Learners


Paper Code Results Date Stars


Task Papers Share
Language Modelling 74 9.67%
Question Answering 49 6.41%
Large Language Model 46 6.01%
Retrieval 32 4.18%
In-Context Learning 32 4.18%
Code Generation 28 3.66%
Prompt Engineering 25 3.27%
Sentence 23 3.01%
Text Generation 19 2.48%