GPT-3 is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.
Source: Language Models are Few-Shot LearnersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 137 | 17.50% |
Question Answering | 58 | 7.41% |
Retrieval | 37 | 4.73% |
Text Generation | 36 | 4.60% |
Few-Shot Learning | 22 | 2.81% |
Prompt Engineering | 20 | 2.55% |
Common Sense Reasoning | 14 | 1.79% |
Zero-Shot Learning | 13 | 1.66% |
Natural Language Inference | 13 | 1.66% |