GPT-3 is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.
Source: Language Models are Few-Shot LearnersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 53 | 19.56% |
Few-Shot Learning | 26 | 9.59% |
Question Answering | 24 | 8.86% |
Text Generation | 16 | 5.90% |
Pretrained Language Models | 13 | 4.80% |
Natural Language Inference | 11 | 4.06% |
Zero-Shot Learning | 10 | 3.69% |
Natural Language Understanding | 9 | 3.32% |
Semantic Parsing | 6 | 2.21% |