Autoregressive Transformers


Introduced by Brown et al. in Language Models are Few-Shot Learners

GPT-3 is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

Source: Language Models are Few-Shot Learners


Paper Code Results Date Stars


Task Papers Share
Language Modelling 137 17.50%
Question Answering 58 7.41%
Retrieval 37 4.73%
Text Generation 36 4.60%
Few-Shot Learning 22 2.81%
Prompt Engineering 20 2.55%
Common Sense Reasoning 14 1.79%
Zero-Shot Learning 13 1.66%
Natural Language Inference 13 1.66%